numpy where - how to set condition on whole column?

numpy where - how to set condition on whole column? - python

How to implement :
t=np.where(<exists at least 1 zero in the same column of t>,t,np.zeros_like(t))
in the "pythonic" way?
this code should set all column to zero in t if t has at least 1 zero in that column
Example :
1 1 1 1 1 1
0 1 1 1 1 1
1 1 0 1 0 1
should turn to
0 1 0 1 0 1
0 1 0 1 0 1
0 1 0 1 0 1

any is what you need
~(arr == 0).any(0, keepdims=True) * arr
0 1 0 1 0 1
0 1 0 1 0 1
0 1 0 1 0 1

this code should set all column to zero in t if t has at least 1 zero
in that column
The simplest way to do this particular task:
t * t.min(0)
A more general way to do it (in case you have an array with different values and the condition is: if a column has at least one occurrence of some_value, then set that column to some_value).
cond = (arr == some_value).any(0)
arr[:, cond] = some_value

Related

Get maximum occurance of one specific value per row with pandas

I have the following dataframe:
1 2 3 4 5 6 7 8 9
0 0 0 1 0 0 0 0 0 1
1 0 0 0 0 1 1 0 1 0
2 1 1 0 1 1 0 0 1 1
...
I want to get for each row the longest sequence of value 0 in the row.
so, the expected results for this dataframe will be an array that looks like this:
[5,4,2,...]
as on the first row, maximum sequenc eof value 0 is 5, ect.
I have seen this post and tried for the beginning to get this for the first row (though I would like to do this at once for the whole dataframe) but I got errors:
s=df_day.iloc[0]
(~s).cumsum()[s].value_counts().max()
TypeError: ufunc 'invert' not supported for the input types, and the
inputs could not be safely coerced to any supported types according to
the casting rule ''safe''
when I inserted manually the values like this:
s=pd.Series([0,0,1,0,0,0,0,0,1])
(~s).cumsum()[s].value_counts().max()
>>>7
I got 7 which is number of total 0 in the row but not the max sequence.
However, I don't understand why it raises the error at first, and , more important, I would like to run it on the end on the while dataframe and per row.
My end goal: get the maximum uninterrupted occurance of value 0 in a row.

Vectorized solution for counts consecutive 0 per rows, so for maximal use max of DataFrame c:
#more explain https://stackoverflow.com/a/52718619/2901002
m = df.eq(0)
b = m.cumsum(axis=1)
c = b.sub(b.mask(m).ffill(axis=1).fillna(0)).astype(int)
print (c)
1 2 3 4 5 6 7 8 9
0 1 2 0 1 2 3 4 5 0
1 1 2 3 4 0 0 1 0 1
2 0 0 1 0 0 1 2 0 0
df['max_consecutive_0'] = c.max(axis=1)
print (df)
1 2 3 4 5 6 7 8 9 max_consecutive_0
0 0 0 1 0 0 0 0 0 1 5
1 0 0 0 0 1 1 0 1 0 4
2 1 1 0 1 1 0 0 1 1 2

Use:
df = df.T.apply(lambda x: (x != x.shift()).astype(int).cumsum().where(x.eq(0)).dropna().value_counts().max())
OUTPUT
0 5
1 4
2 2

The following code should do the job.
the function longest_streak will count the number of consecutive zeros and return the max, and you can use apply on your df.
from itertools import groupby
def longest_streak(l):
lst = []
for n,c in groupby(l):
num,count = n,sum(1 for i in c)
if num==0:
lst.append((num,count))
maxx = max([y for x,y in lst])
return(maxx)
df.apply(lambda x: longest_streak(x),axis=1)

How to create by default two columns for every features (One Hot Encoding)?

My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0

I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.

Counting instances in dataframe that match to another instance

So, I am working with over 100 attributes. Clearly cannot be using this
df['column_name'] >= 1 & df['column_name'] <= 1
Say my dataframe looks like this-
A B C D E F G H I
1 1 1 1 1 0 1 1 0
0 0 1 1 0 0 0 0 1
0 0 1 0 0 0 1 1 1
0 1 1 1 1 0 0 0 0
I wish to find #instances with value 1 for labels C and I . Answer here is two( 2nd and 3rd row). I am working with a lot of attributes certainly cannot hardcode them. How can I be finding the frequency?
Consider I have access to the list of class labels I wish to work with i.e. [C,I]

I think you want DataFrame.all:
df[['C','I']].eq(1).all(axis=1).sum()
#2
We can also use:
df[['C','I']].astype(bool).all(axis=1).sum()

creating conditions on np.where in Pandas based on value in current column

I have a dataframe in Pandas (subset below).
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1 IN & 200D_MA both =1, results 1
11/4/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/5/2013 0 1 1 PREVIOUS TEST ROW =1 & 200DM_A = 1, TEST ans=1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0 PREVIOUS TEST ROW =1 & 200DM_A = 0, TEST ans=0
This is easy to do in excel so I thought it would be easy to do in python. I have this code using nested np.where formulas
df3['TEST'] = np.where( (df3['IN'] == 1) & (df3['200D_MA'] == 1),1,\
np.where( (df3['TEST'].shift(-1) == 1)\
& (df3['200D_MA'] == 1),1,0))
but it throws a KeyError: 'IN' > presumably because I am using a condition from column that has not been created yet. Can anyone help me figure out how to do this?

Seems like you need condition ffill
df['TEST']=df.loc[df.IN==1,'IN']
df.loc[df['200D_MA']==1,'TEST']=df.loc[df['200D_MA']==1,'TEST'].ffill()
df.fillna(0,inplace=True)
df.TEST=df.TEST.astype(int)
df
Out[349]:
DATE IN 200D_MA TEST
0 10/30/2013 0 1 0
1 10/31/2013 0 1 0
2 11/1/2013 1 1 1
3 11/4/2013 0 1 1
4 11/5/2013 0 1 1
5 11/6/2013 0 1 1
6 11/7/2013 0 1 1
7 11/8/2013 0 1 1
8 11/11/2013 0 0 0

I think you can use rolling to calculate previous TEST row.
df['TEST'] = (df['IN 200D_MA'] & df['IN 200D_MA'].rolling(2).min().shift(1)).astype(int)
Output:
DATE IN 200D_MA TEST
10/30/2013 0 1 0
10/31/2013 0 1 0
11/1/2013 1 1 1
11/4/2013 0 1 1
11/5/2013 0 1 1
11/6/2013 0 1 1
11/7/2013 0 1 1
11/8/2013 0 1 1
11/11/2013 0 0 0

python function on pandas df using multiple columns and reset variable

Whats the best way to do the following in python/pandas please?
I want to count the occurences where trend data 2 steps out of line with trend data 1 and reset the counter each time trend data 1 changes.
I'm struggling with the right way to do it on the dataframe creating a new column df['D'] in this example.
df['A'] = trend data 1
df['B'] = boolean indicator if trend data 1 changes
df['C'] = trend data 2
df['D'] = desired result
df['A'] df['B'] df['C'] df['D']
1 0 1 0
1 0 1 0
-1 1 -1 0
-1 0 -1 0
-1 0 1 1
-1 0 -1 1
-1 0 -1 1
-1 0 1 2
-1 0 1 2
-1 0 -1 2
1 1 1 0
1 0 1 0
1 0 -1 1
1 0 1 1
1 0 -1 2
1 0 1 2
1 0 1 2
in excel i would simply use:
=IF(B2=1,0,IF(AND((C2<>C1),(C2<>A2)),D1+1,D1))
however, i've always struggled in not being able to reference prior cells in pandas.
I can't use np.where(). I'm sure its just apply a function in the correct way but I can't seem to make it work referencing other columns and resetting the variable. I've looked at other answers but can't seem to find anything to work in this situation.
something like
note: create df['E'] = df['C'].shift(1)
def corrections(x):
if df['B'] == 1:
x = 0
elif ((df['C'] != df['E']) AND ( df['C'] != df['A'])):
x = x + 1
else:
x
apologies, as I feel i'm missing something rather simple with this question but just keep going round in circles!

def make_D (df):
counter = 0
array = []
for index in df.index:
if df.loc[index, 'A']!=df.loc[index, 'C']:
counter = counter + 1
if index>0:
if df.loc[index, 'B'] != df.loc[index-1, 'B']:
counter = 0
array.append(counter)
df['D'] = array
return (df)
new_df = make_D(df)
hope it helps!

#Set a list to store values for column D
d = []
#calculate D using the given conditions
df.apply(lambda x: d.append(0) if ((x.name==0)|(x.B==1)) else d.append(d[-1]+1) if (x.C!=df.iloc[x.name-1].C) & (x.C!=x.A) else d.append(d[-1]), axis=1)
#set columns D using values from the list d.
df['D'] = d
Out[594]:
A B C D
0 1 0 1 0
1 1 0 1 0
2 -1 1 -1 0
3 -1 0 -1 0
4 -1 0 1 1
5 -1 0 -1 1
6 -1 0 -1 1
7 -1 0 1 2
8 -1 0 1 2
9 -1 0 -1 2
10 1 1 1 0
11 1 0 1 0
12 1 0 -1 1
13 1 0 1 1
14 1 0 -1 2
15 1 0 1 2
16 1 0 1 2

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

numpy where - how to set condition on whole column? - python

any is what you need ~(arr == 0).any(0, keepdims=True) * arr 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1

Related

Get maximum occurance of one specific value per row with pandas

How to create by default two columns for every features (One Hot Encoding)?

Counting instances in dataframe that match to another instance

creating conditions on np.where in Pandas based on value in current column

python function on pandas df using multiple columns and reset variable

Categories

Resources