Creating new column from filtering others - python

I need to assign to a new column the value 1 or 0 depending on what other columns have.
I have around 30 columns with binary values (1 or 0), but also other variables with numeric, continuous, values (e.g. 200). I would like to avoid the write a logical condition with many OR, so I was wondering if there is an easy and fast way to do it.
For example, creating a list with name of columns and assign 1 to the new column if there is at least a value 1 across all the columns for that corresponding row.
Example:
a1 b1 d4 ....
1 0 1
0 0 1
0 0 0
...
Expected:
a1 b1 d4 .... New
1 0 1 1
0 0 1 1
0 0 0 0
...
Many thanks for your help

Here is a simple solution:
df = pd.DataFrame({'a1':[1,0,0,1], 'b1':[0,0,0,1], 'd4':[1,1,0,0], 'num':[12,-2,0,3]})
df['New'] = df[['a1','b1','d4']].any(1).astype('int')
df
a1 b1 d4 num New
0 1 0 1 12 1
1 0 0 1 -2 1
2 0 0 0 0 0
3 1 1 0 3 1

Related

Check if n consecutive elements equals x and any previous element is greater than x

I have a pandas dataframe with 6 mins readings. I want to mark each row as either NF or DF.
NF = rows with 5 consecutive entries being 0 and at least one prior reading being greater than 0
DF = All other rows that do not meet the NF rule
[[4,6,7,2,1,0,0,0,0,0]
[6,0,0,0,0,0,2,2,2,5]
[0,0,0,0,0,0,0,0,0,0]
[0,0,0,0,0,4,6,7,2,1]]
Expected Result:
[NF, NF, DF, DF]
Can I use a sliding window for this? What is a good pythonic way of doing this?
using staring numpy vectorised solution, two conditions operating on truth matrix
uses fact that True is 1 so cumsum() can be used
position of 5th zero should be 4 places higher than 1st
if you just want the array, the np.where() gives that without assigning if back to a dataframe column
used another test case [1,0,0,0,0,1,0,0,0,0] where there are many zeros, but not 5 consecutive
df = pd.DataFrame([[4,6,7,2,1,0,0,0,0,0],
[6,0,0,0,0,0,2,2,2,5],
[0,0,0,0,0,0,0,0,0,0],
[1,0,0,0,0,1,0,0,0,0],
[0,0,0,0,0,4,6,7,2,1]])
df = df.assign(res=np.where(
# five consecutive zeros
((np.argmax(np.cumsum(df.eq(0).values, axis=1)==1, axis=1)+4) ==
np.argmax(np.cumsum(df.eq(0).values, axis=1)==5, axis=1)) &
# first zero somewhere other that 0th position
np.argmax(df.eq(0).values, axis=1)>0
,"NF","DF")
)
0
1
2
3
4
5
6
7
8
9
res
0
4
6
7
2
1
0
0
0
0
0
NF
1
6
0
0
0
0
0
2
2
2
5
NF
2
0
0
0
0
0
0
0
0
0
0
DF
3
1
0
0
0
0
1
0
0
0
0
DF
4
0
0
0
0
0
4
6
7
2
1
DF

How to create by default two columns for every features (One Hot Encoding)?

My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0
I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.

Counting instances in dataframe that match to another instance

So, I am working with over 100 attributes. Clearly cannot be using this
df['column_name'] >= 1 & df['column_name'] <= 1
Say my dataframe looks like this-
A B C D E F G H I
1 1 1 1 1 0 1 1 0
0 0 1 1 0 0 0 0 1
0 0 1 0 0 0 1 1 1
0 1 1 1 1 0 0 0 0
I wish to find #instances with value 1 for labels C and I . Answer here is two( 2nd and 3rd row). I am working with a lot of attributes certainly cannot hardcode them. How can I be finding the frequency?
Consider I have access to the list of class labels I wish to work with i.e. [C,I]
I think you want DataFrame.all:
df[['C','I']].eq(1).all(axis=1).sum()
#2
We can also use:
df[['C','I']].astype(bool).all(axis=1).sum()

Create difference columns from one hot encoded columns

I'm trying to create some extra features on a data set. I want to get a spatial context from the features I already have one hot encoded. So for example, I have this:
F1 F2 F3 F4
1 0 1 1 0
2 1 0 1 1
3 1 0 0 0
4 0 0 0 1
I want to create some new columns against the values here:
F1 F2 F3 F4 S1 S2 S3 S4
1 0 1 1 0 0 2 1 0
2 1 0 0 1 1 0 0 3
3 1 0 0 0 1 0 0 0
4 0 0 0 1 0 0 0 4
I'm hoping there is an easy way to do this, to calculate changes from the last value of the column and output that to a corresponding column. Any help is appreciated, thanks.
You could do:
def func(x):
# create result array
result = np.zeros(x.shape, dtype=np.int)
# get indices of array distinct of zero
w = np.argwhere(x).ravel()
# compute the difference between consecutive indices and add the first index + 1
array = np.hstack(([w[0] + 1], np.ediff1d(w)))
# set the values on result
np.put(result, w, array)
return result
columns = ['S{}'.format(i) for i in range(1, 5)]
s = pd.DataFrame(df.ne(0).apply(func, axis=1).values.tolist(),
columns=columns)
result = pd.concat([df, s], axis=1)
print(result)
Output
F1 F2 F3 F4 S1 S2 S3 S4
0 0 1 1 0 0 2 1 0
1 1 0 0 1 1 0 0 3
2 1 0 0 0 1 0 0 0
3 0 0 0 1 0 0 0 4
Note that you need to import numpy (import numpy as np) in order for func to work. The idea is to find the indices distinct of zero compute the difference between to consecutive values, set the first value as the index + 1, and do this for each row.

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p

Categories