Pandas: add column with progressive count of elements meeting a condition - python

Given the following dataframe df:
df = pd.DataFrame({'A':['Tony', 'Mike', 'Jen', 'Anna'], 'B': ['no', 'yes', 'no', 'yes']})
A B
0 Tony no
1 Mike yes
2 Jen no
3 Anna yes
I want to add another column that counts, progressively, the elements with df['B']='yes':
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2
How can I do this?

You can use numpy.where with cumsum of boolean mask:
m = df['B']=='yes'
df['C'] = np.where(m, m.cumsum(), 0)
Another solution is count boolean mask created by filtering and then add 0 values by reindex:
m = df['B']=='yes'
df['C'] = m[m].cumsum().reindex(df.index, fill_value=0)
print (df)
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2
Performance (in real data should be different, best check it first):
np.random.seed(123)
N = 10000
L = ['yes','no']
df = pd.DataFrame({'B': np.random.choice(L, N)})
print (df)
In [150]: %%timeit
...: m = df['B']=='yes'
...: df['C'] = np.where(m, m.cumsum(), 0)
...:
1.57 ms ± 34.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [151]: %%timeit
...: m = df['B']=='yes'
...: df['C'] = m[m].cumsum().reindex(df.index, fill_value=0)
...:
2.53 ms ± 54.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [152]: %%timeit
...: df['C'] = df.groupby('B').cumcount() + 1
...: df['C'].where(df['B'] == 'yes', 0, inplace=True)
4.49 ms ± 27.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

You can use GroupBy + cumcount followed by pd.Series.where:
df['C'] = df.groupby('B').cumcount() + 1
df['C'].where(df['B'] == 'yes', 0, inplace=True)
print(df)
A B C
0 Tony no 0
1 Mike yes 1
2 Jen no 0
3 Anna yes 2

Related

Loop pandas data frame

i have below data frame and want to do loop:
df = name
a
b
c
d
i have tried below code:
for index, row in df.iterrows():
for line in df['name']:
print(index, line)
but the result i want is a dataframe as below:
df = name name1
a a
a b
a c
a d
b a
b b
b c
b d
etc.
is there any possible way to do it? i know its a stupid question but im new to python
One way using pandas.DataFrame.explode:
df["name1"] = [df["name"] for _ in df["name"]]
df.explode("name1")
Output:
name name1
0 a a
0 a b
0 a c
0 a d
1 b a
1 b b
1 b c
1 b d
2 c a
2 c b
2 c c
2 c d
3 d a
3 d b
3 d c
3 d d
Fastest solution in numpy, thank you #Ch3steR:
df = pd.DataFrame({'name':np.repeat(df['name'],len(df)),
'name1':np.tile(df['name'],len(df))}
Use itertools.product with DataFrame constructor:
from itertools import product
df = pd.DataFrame(product(df['name'], df['name']), columns=['name','name1'])
#oldier pandas versions
#df = pd.DataFrame(list(product(df['name'], df['name'])), columns=['name','name1'])
print (df)
name name1
0 a a
1 a b
2 a c
3 a d
4 b a
5 b b
6 b c
7 b d
8 c a
9 c b
10 c c
11 c d
12 d a
13 d b
14 d c
15 d d
Another idea is use cross join, best solution if performance is important:
df1 = df.assign(new=1)
df = df1.merge(df1, on='new', suffixes=('','1')).drop('new', axis=1)
Performance:
from itertools import product
df = pd.DataFrame({'name':range(1000)})
# print (df)
In [17]: %%timeit
...: df["name1"] = [df["name"] for _ in df["name"]]
...: df.explode("name1")
...:
...:
18.9 s ± 1.18 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [18]: %%timeit
...: pd.DataFrame(product(df['name'], df['name']), columns=['name','name1'])
...:
1.01 s ± 62.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [19]: %%timeit
...: df1 = df.assign(new=1)
...: df1.merge(df1, on='new', suffixes=('','1')).drop('new', axis=1)
...:
...:
245 ms ± 21.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [20]: %%timeit
...: pd.DataFrame({'name':np.repeat(df['name'],len(df)), 'name1':np.tile(df['name'],len(df))})
...:
30.2 ms ± 1.43 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

feature crossing in pandas

I have 2 columns in pandas DF:
col_A col_B
0 1
0 0
0 1
0 1
1 0
1 0
1 1
I want to create a new columns for each value of the combination of col_A and col_B similar to get_dummies(), but the only change is here I am trying to use a combination of columns
Example OP - In this column the value of Col_A is 0 and col_B is 1:
col_A_0_col_B_1
1
0
1
1
0
0
0
I am currently using the iterrows() to iterate through every row to check the value and then change
Is there a usual pandas shorter approach to achieve this.
Convert chained boolean masks to integers:
df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
For better performance:
df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
Performance: Depends of number of rows and 0, 1 values:
np.random.seed(343)
#10k rows
df = pd.DataFrame(np.random.choice([0,1], size=(10000, 2)), columns=['col_A','col_B'])
#print (df)
In [92]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A']==0)&(df['col_B']==1)).astype(int)
...:
870 µs ± 44.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [93]: %%timeit
...: df['col_A_0_col_B_1'] = ((df['col_A'].values==0)&(df['col_B'].values==1)).astype(int)
...:
201 µs ± 3.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [94]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
...:
833 µs ± 12.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [95]: %%timeit
...: df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
...:
956 µs ± 242 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [96]: %%timeit
...: df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
...:
1.61 ms ± 57.2 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [97]: %%timeit
...: df['col_A_0_col_B_1'] = 0
...: df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
...:
3.07 ms ± 68.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
You can use np.where
df['col_A_0_col_B_1'] = np.where((df['col_A']==0)&(df['col_B']==1), 1, 0)
First create your column and assign is e.g. 0 for False
df['col_A_0_col_B_1'] = 0
Then using loc you can filter by where col_A == 0 and col_B ==1 and then assign 1 to the new column
df.loc[(df.col_A == 0) & (df.col_B==1),'col_A_0_col_B_1'] = 1
If I understood correctly, you could do something like this:
import pandas as pd
data = [[0, 1],
[0, 0],
[0, 1],
[0, 1],
[1, 0],
[1, 0],
[1, 1]]
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series([a == 0 and b == 1 for a, b in zip(df.col_A, df.col_B)], dtype='uint')
print(df)
Output
col_A col_B col_A_0_col_B_1
0 0 1 1
1 0 0 0
2 0 1 1
3 0 1 1
4 1 0 0
5 1 0 0
6 1 1 0
Or as alternative:
df = pd.DataFrame(data=data, columns=['col_A', 'col_B'])
df['col_A_0_col_B_1'] = pd.Series((df.col_A == 0) & (df.col_B == 1), dtype='uint')
print(df)
You can use pandas ~ for boolean not, coupled with 1 and 0 being true and false.
df['col_A_0_col_B_1'] = ~df['col_A'] & df['col_B']

How to map new variable in pandas in effective way

Here's my data
Id Amount
1 6
2 2
3 0
4 6
What I need, is to map : if Amount is more than 3 , Map is 1. But,if Amount is less than 3, Map is 0
Id Amount Map
1 6 1
2 2 0
3 0 0
4 5 1
What I did
a = df[['Id','Amount']]
a = a[a['Amount'] >= 3]
a['Map'] = 1
a = a[['Id', 'Map']]
df= df.merge(a, on='Id', how='left')
df['Amount'].fillna(0)
It works, but not highly configurable and not effective.
Convert boolean mask to integer:
#for better performance convert to numpy array
df['Map'] = (df['Amount'].values >= 3).astype(int)
#pure pandas solution
df['Map'] = (df['Amount'] >= 3).astype(int)
print (df)
Id Amount Map
0 1 6 1
1 2 2 0
2 3 0 0
3 4 6 1
Performance:
#[400000 rows x 3 columns]
df = pd.concat([df] * 100000, ignore_index=True)
In [133]: %timeit df['Map'] = (df['Amount'].values >= 3).astype(int)
2.44 ms ± 97.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [134]: %timeit df['Map'] = (df['Amount'] >= 3).astype(int)
2.6 ms ± 66.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Efficient and fastest way in Pandas to create sorted list from column values

Given a dataframe
A B C
3 1 2
2 1 3
3 2 1
I would like to get a new column with column names in sorted order
A B C new_col
3 1 2 [B,C,A]
2 1 3 [B,A,C]
3 2 1 [C,B,A]
This is my code. It works but is quite slow.
def blist(x):
col_dict = {}
for col in col_list:
col_dict[col] = x[col]
sorted_tuple = sorted(col_dict.items(), key=operator.itemgetter(1))
return [i[0] for i in sorted_tuple]
df['new_col'] = df.apply(blist,axis=1)
I will appreciate a better approach to solve this problem.
Try to use np.argsort() in conjunction with np.take():
In [132]: df['new_col'] = np.take(df.columns, np.argsort(df)).tolist()
In [133]: df
Out[133]:
A B C new_col
0 3 1 2 [B, C, A]
1 2 1 3 [B, A, C]
2 3 2 1 [C, B, A]
Timing for 30.000 rows DF:
In [182]: df = pd.concat([df] * 10**4, ignore_index=True)
In [183]: df.shape
Out[183]: (30000, 3)
In [184]: %timeit df.apply(blist,axis=1)
4.84 s ± 31 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
In [185]: %timeit np.take(df.columns, np.argsort(df)).tolist()
5.45 ms ± 26.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Ratio:
In [187]: (4.84*1000)/5.45
Out[187]: 888.0733944954128

Pandas Dataframe Find Rows Where all Columns Equal

I have a dataframe that has characters in it - I want a boolean result by row that tells me if all columns for that row have the same value.
For example, I have
df = [ a b c d
0 'C' 'C' 'C' 'C'
1 'C' 'C' 'A' 'A'
2 'A' 'A' 'A' 'A' ]
and I want the result to be
0 True
1 False
2 True
I've tried .all but it seems I can only check if all are equal to one letter. The only other way I can think of doing it is by doing a unique on each row and see if that equals 1? Thanks in advance.
I think the cleanest way is to check all columns against the first column using eq:
In [11]: df
Out[11]:
a b c d
0 C C C C
1 C C A A
2 A A A A
In [12]: df.iloc[:, 0]
Out[12]:
0 C
1 C
2 A
Name: a, dtype: object
In [13]: df.eq(df.iloc[:, 0], axis=0)
Out[13]:
a b c d
0 True True True True
1 True True False False
2 True True True True
Now you can use all (if they are all equal to the first item, they are all equal):
In [14]: df.eq(df.iloc[:, 0], axis=0).all(1)
Out[14]:
0 True
1 False
2 True
dtype: bool
Compare array by first column and check if all Trues per row:
Same solution in numpy for better performance:
a = df.values
b = (a == a[:, [0]]).all(axis=1)
print (b)
[ True True False]
And if need Series:
s = pd.Series(b, axis=df.index)
Comparing solutions:
data = [[10,10,10],[12,12,12],[10,12,10]]
df = pd.DataFrame(data,columns=['Col1','Col2','Col3'])
#[30000 rows x 3 columns]
df = pd.concat([df] * 10000, ignore_index=True)
#jez - numpy array
In [14]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
141 µs ± 3.23 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#jez - Series
In [15]: %%timeit
...: a = df.values
...: b = (a == a[:, [0]]).all(axis=1)
...: pd.Series(b, index=df.index)
169 µs ± 2.02 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
#Andy Hayden
In [16]: %%timeit
...: df.eq(df.iloc[:, 0], axis=0).all(axis=1)
2.22 ms ± 68.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
#Wen1
In [17]: %%timeit
...: list(map(lambda x : len(set(x))==1,df.values))
56.8 ms ± 1.04 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#K.-Michael Aye
In [18]: %%timeit
...: df.apply(lambda x: len(set(x)) == 1, axis=1)
686 ms ± 23.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
#Wen2
In [19]: %%timeit
...: df.nunique(1).eq(1)
2.87 s ± 115 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
nunique: New in version 0.20.0.(Base on timing benchmark from Jez , if performance is not important you can using this one)
df.nunique(axis = 1).eq(1)
Out[308]:
0 True
1 False
2 True
dtype: bool
Or you can using map with set
list(map(lambda x : len(set(x))==1,df.values))
df = pd.DataFrame.from_dict({'a':'C C A'.split(),
'b':'C C A'.split(),
'c':'C A A'.split(),
'd':'C A A'.split()})
df.apply(lambda x: len(set(x)) == 1, axis=1)
0 True
1 False
2 True
dtype: bool
Explanation: set(x) has only 1 element, if all elements of the row are the same. The axis=1 option applies any given function over the rows instead.
You can use nunique(axis=1) so the results (added to a new column) can be obtained by:
df['unique'] = df.nunique(axis=1) == 1
The answer by #yo-and-ben-w uses eq(1) but I think == 1 is easier to read.

Categories