Change 1st row of a dataframe based on a condition in pandas - python

I have 2 columns on whose value I want to update the third column for only 1 row.
I have-
df = pd.DataFrame({'A':[1,1,2,3,4,4],
'B':[2,2,4,3,2,1],
'C':[0] * 6})
print (df)
A B C
0 1 2 0
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
If A= 1 and B=2 then only 1st row has C=1 like this -
print (df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
Right now I have used
df.loc[(df['A']==1) & (df['B']==2)].iloc[[0]].loc['C'] = 1
but it doesn't change the dataframe.

Solution if match always at least one row:
Create boolean mask and set to first True index value by idxmax:
mask = (df['A']==1) & (df['B']==2)
df.loc[mask.idxmax(), 'C'] = 1
But if no value matched idxmax return first False value, so added if-else:
mask = (df['A']==1) & (df['B']==2)
idx = mask.idxmax() if mask.any() else np.repeat(False, len(df))
df.loc[idx, 'C'] = 1
print (df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
mask = (df['A']==10) & (df['B']==20)
idx = mask.idxmax() if mask.any() else np.repeat(False, len(df))
df.loc[idx, 'C'] = 1
print (df)
A B C
0 1 2 0
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0

Using pd.Series.cumsum to ensure only the first matching criteria is satisfied:
mask = df['A'].eq(1) & df['B'].eq(2)
df.loc[mask & mask.cumsum().eq(1), 'C'] = 1
print(df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
If performance is a concern, see Efficiently return the index of the first value satisfying condition in array.

Related

Using previous row value while creating a new column

I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)
Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64
Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0

How to apply cummulative count on multiple columns of dataframe

Dataframe
a b c
0 0 1 1
1 0 1 1
2 0 0 1
3 0 0 1
4 1 1 0
5 1 1 1
6 1 1 1
7 0 0 1
I am trying apply cummulative count cumcount on multiple columns of dataframe, i have tried applying the cummulative count by grouping each column. Is there any easy way to achieve expected output
I have tried this code , but it is not working
li =[]
for column in df.columns:
li.append(df.groupby(column)[column].cumcount())
pd.concat(li,axis=1)
Expected output
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Create consecutive groups by comparing with shifted values and for each column apply cumcount, last set 1 by boolean mask:
df = (df.ne(df.shift()).cumsum()
.apply(lambda x: df.groupby(x).cumcount() + 1)
.mask(df == 0, 1))
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3
Another solution if performance is important - count only 1 values and last set 1 by mask by np.where:
a = df == 1
b = a.cumsum()
arr = np.where(a, b-b.mask(a).ffill().fillna(0).astype(int), 1)
df = pd.DataFrame(arr, index=df.index, columns=df.columns)
print (df)
a b c
0 1 1 1
1 1 2 2
2 1 1 3
3 1 1 4
4 1 1 1
5 2 2 1
6 3 3 2
7 1 1 3

Find first row with condition after each row satisfying another condition

in pandas I have the following data frame:
a b
0 0
1 1
2 1
0 0
1 0
2 1
Now I want to do the following:
Create a new column c, and for each row where a = 0 fill c with 1. Then c should be filled with 1s until the first row after each column fulfilling that, where b = 1 (and here im hanging), so the output should look like this:
a b c
0 0 1
1 1 1
2 1 0
0 0 1
1 0 1
2 1 1
Thanks!
It seems you need:
df['c'] = df.groupby(df.a.eq(0).cumsum())['b'].cumsum().le(1).astype(int)
print (df)
a b c
0 0 0 1
1 1 1 1
2 2 1 0
3 0 0 1
4 1 0 1
5 2 1 1
Detail:
print (df.a.eq(0).cumsum())
0 1
1 1
2 1
3 2
4 2
5 2
Name: a, dtype: int32

return rows with unique pairs across columns

I'm trying to find rows that have unique pairs of values across 2 columns, so this dataframe:
A B
1 0
2 0
3 0
0 1
2 1
3 1
0 2
1 2
3 2
0 3
1 3
2 3
will be reduced to only the rows that don't match up if flipped, for instance 1 and 3 is a combination I only want returned once. So a check to see if the same pair exists if the columns are flipped (3 and 1) it can be removed. The table I'm looking to get is:
A B
0 2
0 3
1 0
1 2
1 3
2 3
Where there is only one occurrence of each pair of values that are mirrored if the columns are flipped.
I think you can use apply sorted + drop_duplicates:
df = df.apply(sorted, axis=1).drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Faster solution with numpy.sort:
df = pd.DataFrame(np.sort(df.values, axis=1), index=df.index, columns=df.columns)
.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Solution without sorting with DataFrame.min and DataFrame.max:
a = df.min(axis=1)
b = df.max(axis=1)
df['A'] = a
df['B'] = b
df = df.drop_duplicates()
print (df)
A B
0 0 1
1 0 2
2 0 3
4 1 2
5 1 3
8 2 3
Loading the data:
import numpy as np
import pandas as pd
a = np.array("1 2 3 0 2 3 0 1 3 0 1 2".split("\t"),dtype=np.double)
b = np.array("0 0 0 1 1 1 2 2 2 3 3 3".split("\t"),dtype=np.double)
df = pd.DataFrame(dict(A=a,B=b))
In case you don't need to sort the entire DF:
df["trans"] = df.apply(
lambda row: (min(row['A'], row['B']), max(row['A'], row['B'])), axis=1
)
df.drop_duplicates("trans")

Problems with pandas and numpy where condition/multiple values?

I have the follwoing pandas dataframe:
A B
1 3
0 3
1 2
0 1
0 0
1 4
....
0 0
I would like to add a new column at the right side, following the following condition:
If the value in B has 3 or 2 add 1 in the new_col for instance:
(*)
A B new_col
1 3 1
0 3 1
1 2 1
0 1 0
0 0 0
1 4 0
....
0 0 0
So I tried the following:
df['new_col'] = np.where(df['B'] == 3 & 2,'1','0')
However it did not worked:
A B new_col
1 3 0
0 3 0
1 2 1
0 1 0
0 0 0
1 4 0
....
0 0 0
Any idea of how to do a multiple contidition statement with pandas and numpy like (*)?.
You can use Pandas isin which will return a boolean showing whether the elements you're looking for are contained in column 'B'.
df['new_col'] = df['B'].isin([3, 2])
A B new_col
0 1 3 True
1 0 3 True
2 1 2 True
3 0 1 False
4 0 0 False
5 1 4 False
Then, you can use astype to convert the boolean values to 0 and 1, True being 1 and False being 0
df['new_col'] = df['B'].isin([3, 2]).astype(int)
Output:
A B new_col
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0
Using numpy:
>>> df['new_col'] = np.where(np.logical_or(df['B'] == 3, df['B'] == 2), '1','0')
>>> df
A B new_col
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0
df['new_col'] = [1 if x in [2, 3] else 0 for x in df.B]
The operators * + ^ work on booleans as expected, and mixing with integers give the expected result. So you can also do:
df['new_col'] = [(x in [2, 3]) * 1 for x in df.B]
using numpy
df['new'] = (df.B.values[:, None] == np.array([2, 3])).any(1) * 1
Timing
over given data set
over 60,000 rows
df=pd.DataFrame({'A':[1,0,1,0,0,1],'B':[3,3,2,1,0,4]})
print df
df['C']=[1 if vals==2 or vals==3 else 0 for vals in df['B'] ]
print df
A B
0 1 3
1 0 3
2 1 2
3 0 1
4 0 0
5 1 4
A B C
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0

Categories