Can I use the previous calculated answer from apply(axis=1) within the current row evaluation?
I have this df:
df = pd.DataFrame(np.random.randn(5,3),columns=list('ABC'))
df
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.527406 0.533451 -0.650252 'b'
2 -1.646425 0.738068 0.562747 'c'
3 -0.045872 0.088864 0.932650 'd'
4 -0.964226 0.542817 0.873731 'e'
and I'm trying to add for each row the value of the previous row multiplied by 2 and added to the current value, without manipulating the string column (e.g row = row + row(shift-1) *0.5).
This is the code I have so far:
def calc_by_previous_answer(row):
#here i have only the current row so I'm unable to get the previous one
row = row * 0.5
return row
#add the shift here will not propagate the previous answer
df = df.apply(calc_by_previous_answer, axis=1)
df
Not easy, but possible with select by previous values by loc, for select only numeric columns use DataFrame.select_dtypes:
def calc_by_previous_answer(row):
#here i have only the current row so I'm unable to get the previous one
#cannot select previous row of first row because not exist
if row.name > 0:
row = df.loc[row.name-1, c] * 0.5 + row
# else:
# row = row * 0.5
return row
c = df.select_dtypes(np.number).columns
df[c] = df[c].apply(calc_by_previous_answer, axis=1)
print (df)
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.378443 0.020945 0.003293 'b'
2 -2.410128 1.004794 0.237621 'c'
3 -0.869085 0.457898 1.214023 'd'
4 -0.987162 0.587249 1.340056 'e'
Solution with no apply with DataFrame.add:
c = df.select_dtypes(np.number).columns
df[c] = df[c].add(df[c].shift() * 0.5, fill_value=0)
print (df)
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.378443 0.020945 0.003293 'b'
2 -2.410128 1.004794 0.237621 'c'
3 -0.869085 0.457898 1.214023 'd'
4 -0.987162 0.587249 1.340056 'e'
EDIT:
c = df.select_dtypes(np.number).columns
for idx, row in df.iterrows():
if row.name > 0:
df.loc[idx, c] = df.loc[idx-1, c] * 0.5 + df.loc[idx, c]
print (df)
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.378443 0.020945 0.003293 'b'
2 -2.335647 0.748541 0.564393 'c'
3 -1.213695 0.463134 1.214847 'd'
4 -1.571074 0.774384 1.481154 'e'
There is no need to use apply, you can solve it as follows. Since you want to use the updated row value in the calculation of the following row value, you need to use a for loop.
cols = ['A','B','C']
for i in range(1, len(df)):
df.loc[i, cols] = df.loc[i-1, cols] * 0.5 + df.loc[i, cols]
Result:
A B C String_column
0 0.297925 -1.025012 1.307090 'a'
1 -1.378443 0.020945 0.003293 'b'
2 -2.335647 0.748541 0.564393 'c'
3 -1.213695 0.463134 1.214847 'd'
4 -1.571074 0.774384 1.481154 'e'
Related
I am trying to apply the following function for each row in a dataframe. The dataframe looks as follows:
vote_1 vote_2 vote_3 vote_4
a a a b
b b a b
b a a b
I am tring to generate a fourth column to sum the 'votes' of the other columns and produce the winner, as follows:
vote_1 vote_2 vote_3 vote_4 winner_columns
a a a b a
b b a b b
b a a b draw
I have currently tried:
def winner(x):
a = new_df.iloc[x].value_counts()['a']
b = new_df.iloc[x].value_counts()['b']
if a > b:
y = 'a'
elif a < b:
y = 'b'
else:
y = 'draw'
return y
df['winner_columns'].apply(winner)
However the whole column gets filled with draws. I assume is something with the way I have build the function but can't figure out what
You can use DataFrame.mode and count non missing values by DataFrame.count, if only one use first column else draw in numpy.where:
df1 = df.mode(axis=1)
print (df1)
0 1
0 a NaN
1 b NaN
2 a b
df['winner_columns'] = np.where(df1.count(axis=1).eq(1), df1[0], 'draw')
print (df)
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
Your solution is possible change:
def winner(x):
s = x.value_counts()
a = s['a']
b = s['b']
if a > b:
y = 'a'
elif a < b:
y = 'b'
else:
y = 'draw'
return y
df['winner_columns'] = df.apply(winner,axis=1)
print (df)
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
The first problem is that your DataFrame contains sometimes
a letter followed by a dot.
So to look for solely 'a' or 'b' you have to replace these dots
with an empty string, something like:
df.replace('\.', '', regex=True)
Another problem, which didin't surface in your case, is that a row can
contain only 'a' or 'b' and your code should be resistant to
absence of particular result in such a source row.
To make your function resistant to such cases, change it to:
def winner(row):
vc = row.value_counts()
a = vc.get('a', 0)
b = vc.get('b', 0)
if a > b: return 'a'
elif a < b: return 'b'
else: return 'draw'
Then you can apply your function, but if you want to apply it to each
row (not column), you should pass axis=1.
So, to sum up, change your code to:
df['winner_columns'] = df.replace('\.', '', regex=True).apply(winner, axis=1)
The result, for your sample data, is:
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a. a. a. b a
1 b. b. a b b
2 b. a. a b draw
You can use .sum() for counting the votes, then you save in a list the winners, finally you add into dataframe.
numpy_votes = dataframe_votes.to_numpy()
winner_columns = []
for i in numpy_votes:
if np.sum(i == 'a') < np.sum(i == 'b'):
winner_columns.append('b')
elif np.sum(i == 'a') > np.sum(i == 'b'):
winner_columns.append('a')
else:
winner_columns.append('draw')
dataframe_votes['winner_columns'] = winner_columns
Using .sum() method is the fastest way to count elements inside arrays according to this answer.
Output:
vote_1 vote_2 vote_3 vote_4 winner_columns
0 a a a b a
1 b b a b b
2 b a a b draw
I have a dataframe with two levels of columns index.
Reproducible Dataset.
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([['A','A','C','D'],
['Name','Name','Company','Company']])
I want to rename the duplicated MultiIndex columns, only when level-0 and level-1 combined is duplicated. Then add a suffix number to the end. Like the one below.
Below is a solution I found, but it only works for single level column index.
class renamer():
def __init__(self):
self.d = dict()
def __call__(self, x):
if x not in self.d:
self.d[x] = 0
return x
else:
self.d[x] += 1
return "%s_%d" % (x, self.d[x])
df = df.rename(columns=renamer())
I think the above method can be modified to support the multi level situation, but I am too new to pandas/python.
Thanks in advance.
#Datanovice
This is to clarify to you about the output what I need.
I have the snippet below.
import pandas as pd
import numpy as np
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([
['A','A','C','A'],
['A','A','C','A'],
['Company','Company','Company','Name']]))
s = pd.DataFrame(df.columns.tolist())
cond = s.groupby(0).cumcount()
s = [np.where(cond.gt(0),s[i] + '_' + cond.astype(str),s[i]) for i in
range(df.columns.nlevels)]
s = pd.DataFrame(s)
#print(s)
df.columns = pd.MultiIndex.from_arrays(s.values.tolist())
print(df)
The current result is-
What I need is the last piece of column index should not be counted as duplicated, as as "A-A-Name" is not same with the first two.
Thank you again.
Might be a better way to do this, but you could return a dataframe from your columns and apply a conditional operation on them and re-assign them.
df = pd.DataFrame(
[ ['Gaz','Gaz','Gaz','Gaz'],
['X','X','X','X'],
['Y','Y','Y','Y'],
['Z','Z','Z','Z']],
columns=pd.MultiIndex.from_arrays([['A','A','C','A'],
['Name','Name','Company','Company']])
s = pd.DataFrame(df.columns.tolist())
cond = s.groupby([0,1]).cumcount()
s[0] = np.where(cond.gt(0),s[0] + '_' + cond.astype(str),s[0])
s[1] = np.where(cond.gt(0),s[1] + '_' + cond.astype(str),s[1])
df.columns = pd.MultiIndex.from_frame(s)
print(df)
0 A A_1 C D
1 Name Name_1 Company Company
0 Gaz Gaz Gaz Gaz
1 X X X X
2 Y Y Y Y
3 Z Z Z Z
Try this -
arrays = [['A', 'A', 'A', 'A', 'B', 'B', 'B', 'B'],['A', 'A', 'A', 'B', 'C', 'C', 'D', 'D']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df = pd.DataFrame(np.random.randn(3, 8), columns=index)
A B
A A A B C C D D
0 0 0 1 3 1 2 1 4
1 0 1 1 1 1 3 0 1
2 1 1 4 2 3 2 1 4
suffix = pd.DataFrame(df.columns)
suffix['count'] = suffix.groupby(0).cumcount()
suffix['new'] = [((i[0]+'_'+str(j)),(i[1]+'_'+str(j))) for i,j in zip(suffix[0],suffix['count'])]
new_index = pd.MultiIndex.from_tuples(list(suffix['new']))
df.columns = new_index
I want column 'x' to be the same as column 'b' when column 'a' = 'b' but if 'a' does not equal 'b' then I want it 'x' to be (('a'+'b')/2):
filename = 'test.csv'
df=pd.read_csv(filename)
df['x'] = np.where(df['a'] = df['b'], df['x'] = df['b']
df['x'] = np.where(df['a'] != df['b'], (df['a'] + df['b']/2))
print(df.head(5))
I'm getting error (KeyWord can't be an expression)
Create your own function and then just use the apply function to have it create your new row.
Example:
import pandas as pd
df = pd.read_csv('something.csv')
def funct(row):
if row['a'] == row['b']:
return row['b']
else:
return (row['a'] + row['b'])/2
df['x'] = df.apply(funct, axis=1)
print(df)
output:
a b x
0 1 1 1.0
1 2 2 2.0
2 3 4 3.5
3 4 3 3.5
4 5 5 5.0
5 6 7 6.5
I think you are looking for:
df['x'] = np.where(df['a'] == df['b'], df['b'], (df['a'] + df['b'])/2)
If a == b (note double equals), then x column takes the value of b, else it takes the value (a + b)/2
I have the following Pandas dataframe:
A B C
A A Test1
A A Test2
A A XYZ
A B BA
A B AB
B A AA
I want to group this dataset twice: First by A and B to concate the group within C and afterwards only on A to get the groups defined solely by column A. The result looks like this:
A A Test1,Test2,XYZ
A B AB, BA
B A AA
And the final result should be:
A A,A:(Test1,Test2,XYZ), A,B:(AB, BA)
B B,A:(AA)
Concatenating itself works, however the sorting does not seem work.
Can anyone help me with this problem?
Kind regards.
Using groupby + join
s1=df.groupby(['A','B']).C.apply(','.join)
s1
Out[421]:
A B
A A Test1,Test2,XYZ
B BA,AB
B A AA
Name: C, dtype: object
s1.reset_index().groupby('A').apply(lambda x : x.set_index(['A','B'])['C'].to_dict())
Out[420]:
A
A {('A', 'A'): 'Test1,Test2,XYZ', ('A', 'B'): 'B...
B {('B', 'A'): 'AA'}
dtype: object
First sort_values by 3 columns, then groupby with join first, then join A with B columns and last groupby for dictionary per groups:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(','.join).reset_index()
#if only 3 columns DataFrame
#df1 = df.sort_values().groupby(['A','B'])['C'].apply(','.join).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A Test1,Test2,XYZ A,A
1 A B AB,BA A,B
2 B A AA B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': 'Test1,Test2,XYZ', 'A,B': 'AB,BA'}
1 B {'B,A': 'AA'}
If need tuples only change first part of code:
df1 = df.sort_values(['A','B','C']).groupby(['A','B'])['C'].apply(tuple).reset_index()
df1['D'] = df1['A'] + ',' + df1['B']
print (df1)
A B C D
0 A A (Test1, Test2, XYZ) A,A
1 A B (AB, BA) A,B
2 B A (AA,) B,A
s = df1.groupby('A').apply(lambda x: dict(zip(x['D'], x['C']))).reset_index(name='val')
print (s)
A val
0 A {'A,A': ('Test1', 'Test2', 'XYZ'), 'A,B': ('AB...
1 B {'B,A': ('AA',)}
Recently, I am converting from SAS to Python pandas. One question I have is that does pandas have a retain like function in SAS,so that I can dynamically referencing the last record. In the following code, I have to manually loop through each line and reference the last record. It seems pretty slow compared to the similar SAS program. Is there anyway that makes it more efficient in pandas? Thank you.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 1, 1, 1], 'B': [0, 0, 1, 0]})
df['C'] = np.nan
df['lag_C'] = np.nan
for row in df.index:
if row == df.head(1).index:
df.loc[row, 'C'] = (df.loc[row, 'A'] == 0) + 0
else:
if (df.loc[row, 'B'] == 1):
df.loc[row, 'C'] = 1
elif (df.loc[row, 'lag_C'] == 0):
df.loc[row, 'C'] = 0
elif (df.loc[row, 'lag_C'] != 0):
df.loc[row, 'C'] = df.loc[row, 'lag_C'] + 1
if row != df.tail(1).index:
df.loc[row +1, 'lag_C'] = df.loc[row, 'C']
Very complicated algorithm, but I try vectorized approach.
If I understand it, there can be use cumulative sum as using in this question. Last column lag_C is shifted column C.
But my algorithm can't be use in first rows of df, because only these rows are counted from first value of column A and sometimes column B. So I created column D, where are distinguished rows and latter are copy to output column C, if conditions are True.
I changed input data and test first problematic rows. I try test all three possibilities of first 3 rows of column B with first row of column A.
My input condition are:
Column A and B are only 1 or O. Column C and lag_C are helper columns with only NaN.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
df1 = pd.DataFrame({'A': [1,1,1,1,1,0,0,1,1,0,0], 'B': [0,0,1,1,0,0,0,1,0,1,0]})
#cumulative sum of column B
df1['C'] = df1['B'].cumsum()
df1['lag_C'] = 1
#first 'group' with min value is problematic, copy to column D for latter use
df1.loc[df1['C'] == df1['C'].min() ,'D'] = df1['B']
#cumulative sums of groups to column C
df1['C']= df1.groupby(['C'])['lag_C'].cumsum()
#correct problematic states in column C, use value from D
if (df1['A'].loc[0] == 1):
df1.loc[df1['D'].notnull() ,'C'] = df1['D']
if ((df1['A'].loc[0] == 1) & (df1['B'].loc[0] == 1)):
df1.loc[df1['D'].notnull() ,'C'] = 0
del df1['D']
#shifted column lag_C from column C
df1['lag_C'] = df1['C'].shift(1)
print df1
# A B C lag_C
#0 1 0 0 NaN
#1 1 0 0 0
#2 1 1 1 0
#3 1 1 1 1
#4 1 0 2 1
#5 0 0 3 2
#6 0 0 4 3
#7 1 1 1 4
#8 1 0 2 1
#9 0 1 1 2
#10 0 0 2 1