Problems with pandas and numpy where condition/multiple values?

Problems with pandas and numpy where condition/multiple values? - python

I have the follwoing pandas dataframe:
A B
1 3
0 3
1 2
0 1
0 0
1 4
....
0 0
I would like to add a new column at the right side, following the following condition:
If the value in B has 3 or 2 add 1 in the new_col for instance:
(*)
A B new_col
1 3 1
0 3 1
1 2 1
0 1 0
0 0 0
1 4 0
....
0 0 0
So I tried the following:
df['new_col'] = np.where(df['B'] == 3 & 2,'1','0')
However it did not worked:
A B new_col
1 3 0
0 3 0
1 2 1
0 1 0
0 0 0
1 4 0
....
0 0 0
Any idea of how to do a multiple contidition statement with pandas and numpy like (*)?.

You can use Pandas isin which will return a boolean showing whether the elements you're looking for are contained in column 'B'.
df['new_col'] = df['B'].isin([3, 2])
A B new_col
0 1 3 True
1 0 3 True
2 1 2 True
3 0 1 False
4 0 0 False
5 1 4 False
Then, you can use astype to convert the boolean values to 0 and 1, True being 1 and False being 0
df['new_col'] = df['B'].isin([3, 2]).astype(int)
Output:
A B new_col
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0

Using numpy:
>>> df['new_col'] = np.where(np.logical_or(df['B'] == 3, df['B'] == 2), '1','0')
>>> df
A B new_col
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0

df['new_col'] = [1 if x in [2, 3] else 0 for x in df.B]
The operators * + ^ work on booleans as expected, and mixing with integers give the expected result. So you can also do:
df['new_col'] = [(x in [2, 3]) * 1 for x in df.B]

using numpy
df['new'] = (df.B.values[:, None] == np.array([2, 3])).any(1) * 1
Timing
over given data set
over 60,000 rows

df=pd.DataFrame({'A':[1,0,1,0,0,1],'B':[3,3,2,1,0,4]})
print df
df['C']=[1 if vals==2 or vals==3 else 0 for vals in df['B'] ]
print df
A B
0 1 3
1 0 3
2 1 2
3 0 1
4 0 0
5 1 4
A B C
0 1 3 1
1 0 3 1
2 1 2 1
3 0 1 0
4 0 0 0
5 1 4 0

Related

Using previous row value while creating a new column

I have a df in python that looks something like this:
'A'
0
1
0
0
1
1
1
1
0
I want to create another column that adds cumulative 1's from column A, and starts over if the value in column A becomes 0 again. So desired output:
'A' 'B'
0 0
1 1
0 0
0 0
1 1
1 2
1 3
1 4
0 0
This is what I am trying, but it's just replicating column A:
df.B[df.A ==0] = 0
df.B[df.A !=0] = df.A + df.B.shift(1)

Let us do cumsum with groupby cumcount
df['B']=(df.groupby(df.A.eq(0).cumsum()).cumcount()).where(df.A==1,0)
Out[81]:
0 0
1 1
2 0
3 0
4 1
5 2
6 3
7 4
8 0
dtype: int64

Use shift with ne and groupby.cumsum:
df['B'] = df.groupby(df['A'].shift().ne(df['A']).cumsum())['A'].cumsum()
print(df)
A B
0 0 0
1 1 1
2 0 0
3 0 0
4 1 1
5 1 2
6 1 3
7 1 4
8 0 0

Change 1st row of a dataframe based on a condition in pandas

I have 2 columns on whose value I want to update the third column for only 1 row.
I have-
df = pd.DataFrame({'A':[1,1,2,3,4,4],
'B':[2,2,4,3,2,1],
'C':[0] * 6})
print (df)
A B C
0 1 2 0
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
If A= 1 and B=2 then only 1st row has C=1 like this -
print (df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
Right now I have used
df.loc[(df['A']==1) & (df['B']==2)].iloc[[0]].loc['C'] = 1
but it doesn't change the dataframe.

Solution if match always at least one row:
Create boolean mask and set to first True index value by idxmax:
mask = (df['A']==1) & (df['B']==2)
df.loc[mask.idxmax(), 'C'] = 1
But if no value matched idxmax return first False value, so added if-else:
mask = (df['A']==1) & (df['B']==2)
idx = mask.idxmax() if mask.any() else np.repeat(False, len(df))
df.loc[idx, 'C'] = 1
print (df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
mask = (df['A']==10) & (df['B']==20)
idx = mask.idxmax() if mask.any() else np.repeat(False, len(df))
df.loc[idx, 'C'] = 1
print (df)
A B C
0 1 2 0
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0

Using pd.Series.cumsum to ensure only the first matching criteria is satisfied:
mask = df['A'].eq(1) & df['B'].eq(2)
df.loc[mask & mask.cumsum().eq(1), 'C'] = 1
print(df)
A B C
0 1 2 1
1 1 2 0
2 2 4 0
3 3 3 0
4 4 2 0
5 4 1 0
If performance is a concern, see Efficiently return the index of the first value satisfying condition in array.

Count occurrences in a Pandas series of floats

I have a DataFrame:
df.head()
Index Value
0 1.0,1.0,1.0,1.0
1 1.0,1.0
2 1.0,1.0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0
4 4
I'd like to count the occurrences of values in the Value column:
Index Value 1 2 3 4
0 1.0,1.0,1.0,1.0 4 0 0 0
1 1.0,1.0 2 0 0 0
2 1.0,1.0 2 0 0 0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0 0 0 6 2
4 4 0 0 0 1
I've done this before with string values but I used Counter - which I found you can't use with floats?
df_counts = df['Value'].apply(lambda x: pd.Series(Counter(x.split(','))), 1).fillna(0).astype(int)

Use map to floats and last columns to integers:
df_counts = (df['Value'].apply(lambda x: pd.Series(Counter(map(float, x.split(',')))), 1)
.fillna(0)
.astype(int)
.rename(columns=int))
print (df_counts)
1 3 4
0 4 0 0
1 2 0 0
2 2 0 0
3 0 6 2
4 0 0 1
Last if necessary add all missing categories add reindex and join to original:
cols = np.arange(df_counts.columns.min(), df_counts.columns.max() + 1)
df = df.join(df_counts.reindex(columns=cols, fill_value=0))
print (df)
Value 1 2 3 4
Index
0 1.0,1.0,1.0,1.0 4 0 0 0
1 1.0,1.0 2 0 0 0
2 1.0,1.0 2 0 0 0
3 3.0,3.0,3.0,3.0,3.0,3.0,4.0,4.0 0 0 6 2
4 4 0 0 0 1

How to set index to an existing dataframe in the form of cartesian product?

I have a list. I want to set_index of dataframe in the form of a cartesian product of list values with dataframe i.e
li = ['A','B']
df = pd.DataFrame([[0,0,0],[1,1,1],[2,2,2]])
I want the resulting dataframe to be like
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
How can I do this?

Option 1
pd.concat with keys argument
pd.concat([df] * len(li), keys=li)
0 1 2
A 0 0 0 0
1 1 1 1
2 2 2 2
B 0 0 0 0
1 1 1 1
2 2 2 2
To replicate your output exactly:
pd.concat([df] * len(li), keys=li).reset_index(1, drop=True)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2
Option 2
np.tile and np.repeat
pd.DataFrame(np.tile(df, [len(li), 1]), np.repeat(li, len(df)), df.columns)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2

Use MultiIndex.from_product with reindex:
mux = pd.MultiIndex.from_product([li, df.index])
df = df.reindex(mux, level=1).reset_index(level=1, drop=True)
print (df)
0 1 2
A 0 0 0
A 1 1 1
A 2 2 2
B 0 0 0
B 1 1 1
B 2 2 2

Or you can using .
li = [['A','B']]
df['New']=li*len(df)
df.set_index([0,1,2])['New'].apply(pd.Series).stack().to_frame().rename(columns={0:'keys'})\
.reset_index().drop('level_3',1).sort_values('keys')
Out[698]:
0 1 2 keys
0 0 0 0 A
2 1 1 1 A
4 2 2 2 A
1 0 0 0 B
3 1 1 1 B
5 2 2 2 B

Viewing duplicated rows in Pandas

I know that if I have a DataFrame object in Pandas that I can find out if the row is a duplicate by using the .duplicated() method on the DataFrame. This will return a Series giving True or False depending on whether the row was a duplicate or not. My question is, is it then possible to index the original DataFrame with this object, such that I only return the duplicates (so that I can visually inspect them)?

In [18]: df = pd.DataFrame(np.random.randint(0, 2, (10, 4)))
In [19]: df
Out[19]:
0 1 2 3
0 0 1 1 0
1 0 1 1 1
2 0 1 1 1
3 1 1 0 0
4 0 1 0 1
5 1 0 1 0
6 0 1 0 1
7 1 1 1 0
8 0 1 1 0
9 0 0 0 1
[10 rows x 4 columns]
In [20]: df[df.duplicated()]
Out[20]:
0 1 2 3
2 0 1 1 1
6 0 1 0 1
8 0 1 1 0
[3 rows x 4 columns]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Problems with pandas and numpy where condition/multiple values? - python

Using numpy: >>> df['new_col'] = np.where(np.logical_or(df['B'] == 3, df['B'] == 2), '1','0') >>> df A B new_col 0 1 3 1 1 0 3 1 2 1 2 1 3 0 1 0 4 0 0 0 5 1 4 0

df['new_col'] = [1 if x in [2, 3] else 0 for x in df.B] The operators * + ^ work on booleans as expected, and mixing with integers give the expected result. So you can also do: df['new_col'] = [(x in [2, 3]) * 1 for x in df.B]

using numpy df['new'] = (df.B.values[:, None] == np.array([2, 3])).any(1) * 1 Timing over given data set over 60,000 rows

df=pd.DataFrame({'A':[1,0,1,0,0,1],'B':[3,3,2,1,0,4]}) print df df['C']=[1 if vals==2 or vals==3 else 0 for vals in df['B'] ] print df A B 0 1 3 1 0 3 2 1 2 3 0 1 4 0 0 5 1 4 A B C 0 1 3 1 1 0 3 1 2 1 2 1 3 0 1 0 4 0 0 0 5 1 4 0

Related

Using previous row value while creating a new column

Change 1st row of a dataframe based on a condition in pandas

Count occurrences in a Pandas series of floats

How to set index to an existing dataframe in the form of cartesian product?

Viewing duplicated rows in Pandas

Categories

Resources