Using two different data frames to compute new variable - python

I have two dataframes of the same dimensions that look like:
df1
ID flag
0 1
1 0
2 1
df2
ID flag
0 0
1 1
2 0
In both dataframes I want to create a new variable that denotes an additive flag. So the new variable will look like this:
df1
ID flag new_flag
0 1 1
1 0 1
2 1 1
df2
ID flag new_flag
0 0 1
1 1 1
2 0 1
So if either flag columns is a 1 the new flag will be a 1.
I tried this code:
df1['new_flag']= 1
df2['new_flag']= 1
df1['new_flag'][(df1['flag']==0)&(df1['flag']==0)]=0
df2['new_flag'][(df2['flag']==0)&(df2['flag']==0)]=0
I would expect the same number of 1 in both new_flag but they differ. Is this because I'm not going row by row? Like this question?
pandas create new column based on values from other columns
If so how do I include criteria from both datafrmes?

You can use np.logical_or to achieve this, if we set df1 to be all 0's except for the last row so we don't just get a column of 1's, we can cast the result of np.logical_or using astype(int) to convert the boolean array to 1 and 0:
In [108]:
df1['new_flag'] = np.logical_or(df1['flag'], df2['flag']).astype(int)
df2['new_flag'] = np.logical_or(df1['flag'], df2['flag']).astype(int)
df1
Out[108]:
ID flag new_flag
0 0 0 0
1 1 0 1
2 2 1 1
In [109]:
df2
Out[109]:
ID flag new_flag
0 0 0 0
1 1 1 1
2 2 0 1

Related

How to remove columns if the value of one specific row is 0

I have a fairly straight forward question but I could not find the answer on stack.
I have a pd.df
Index A B C
0 1 1 0
1 0 0 0
2 1 1 1
3 0 0 1
I simply wish to remove all columns where the fourth row (3) is 0. So only column C would remain. Cheers.
Assuming "Index" the index, you can use boolean indexing:
df2 = df.loc[:, df.iloc[3].ne(0)]
output:
C
0 0
1 0
2 1
3 1
output of df.iloc[3].ne(0):
A False
B False
C True
Name: 3, dtype: bool

pandas: replace values in column with the last character in the column name

I have a dataframe as follows:
import pandas as pd
df = pd.DataFrame({'sent.1':[0,1,0,1],
'sent.2':[0,1,1,0],
'sent.3':[0,0,0,1],
'sent.4':[1,1,0,1]
})
I am trying to replace the non-zero values with the 5th character in the column names (which is the numeric part of the column names), so the output should be,
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
I have tried the following but it does not work,
print(df.replace(1, pd.Series([i[5] for i in df.columns], [i[5] for i in df.columns])))
However when I replace it with column name, the above code works, so I am not sure which part is wrong.
print(df.replace(1, pd.Series(df.columns, df.columns)))
Since you're dealing with 1's and 0's, you can actually just use multiply the dataframe by a range:
df = df * range(1, df.shape[1] + 1)
Output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
Or, if you want to take the numbers from the column names:
df = df * df.columns.str.split('.').str[-1].astype(int)
you could use string multiplication on a boolean array to place the strings based on the condition, and where to restore the zeros:
mask = df.ne(0)
(mask*df.columns.str[5]).where(mask, 0)
To have integers:
mask = df.ne(0)
(mask*df.columns.str[5].astype(int))
output:
sent.1 sent.2 sent.3 sent.4
0 0 0 0 4
1 1 2 0 4
2 0 2 0 0
3 1 0 3 4
And another one, working with an arbitrary condition (here s.ne(0)):
df.apply(lambda s: s.mask(s.ne(0), s.name.rpartition('.')[-1]))

How to create by default two columns for every features (One Hot Encoding)?

My feature engineering runs for different documents. For some documents some features do not exist and followingly the sublist consists only of the same values such as the third sublist [0,0,0,0,0]. One hot encoding of this sublist leads to only one column, while the feature lists of other documents are transformed to two columns. Is there any possibility to tell ohe also to create two columns if it consits only of one and the same value and insert the column in the right spot? The main problem is that my feature dataframe of different documents consists in the end of a different number of columns, which make them not comparable.
import pandas as pd
feature = [[0,0,1,0,0], [1,1,1,0,1], [0,0,0,0,0], [1,0,1,1,1], [1,1,0,1,1], [1,0,1,1,1], [0,1,0,0,0]]
df = pd.DataFrame(feature[0])
df_features_final = pd.get_dummies(df[0])
for feature in feature[1:]:
df = pd.DataFrame(feature)
df_enc = pd.get_dummies(df[0])
print(df_enc)
df_features_final = pd.concat([df_features_final, df_enc], axis = 1, join ='inner')
print(df_features_final)
The result is the following dataframe. As you can see in the changing columntitles, after column 5 does not follow a 1:
0 1 0 1 0 0 1 0 1 0 1 0 1
0 1 0 0 1 1 0 1 0 1 0 1 1 0
1 1 0 0 1 1 1 0 0 1 1 0 0 1
2 0 1 0 1 1 0 1 1 0 0 1 1 0
3 1 0 1 0 1 0 1 0 1 0 1 1 0
4 1 0 0 1 1 0 1 0 1 0 1 1 0
I don't notice the functionality you want in pandas atleast. But, in TensorFlow, we do have
tf.one_hot(
indices, depth, on_value=None, off_value=None, axis=None, dtype=None, name=None
)
Set depth to 2.

Python Selecting and Adding row values of columns in dataframe to create an aggregated dataframe

I need to process my dataframe in Python such that I add the numeric values of numeric columns that lie between 2 rows of the dataframe.
The dataframe can be created using
df = pd.DataFrame(np.array([['a',0,1,0,0,0,0,'i'],
['b',1,0,0,0,0,0,'j'],
['c',0,0,1,0,0,0,'k'],
['None',0,0,0,1,0,0,'l'],
['e',0,0,0,0,1,0,'m'],
['f',0,1,0,0,0,0,'n'],
['None',0,0,0,1,0,0,'o'],
['h',0,0,0,0,1,0,'p']]),
columns=[0,1,2,3,4,5,6,7],
index=[0,1,2,3,4,5,6,7])
I need to add all rows that occur before the 'None' entries and move the aggregated row to a new dataframe that should look like:
Your data frame dtype is mess up , cause you are using the array to assign the value , since one array only accpet one type , so it push all int to become string , we need convert it firstly
df=df.apply(pd.to_numeric,errors ='ignore')# convert
df['newkey']=df[0].eq('None').cumsum()# using cumsum create the key
df.loc[df[0].ne('None'),:].groupby('newkey').agg(lambda x : x.sum() if x.dtype=='int64' else x.head(1))# then we agg
Out[742]:
0 1 2 3 4 5 6 7
newkey
0 a 1 1 1 0 0 0 i
1 e 0 1 0 0 1 0 m
2 h 0 0 0 0 1 0 p
You can also specify the agg funcs
s = lambda s: sum(int(k) for k in s)
d = {i: s for i in range(8)}
d.update({0: 'first', 7: 'first'})
df.groupby((df[0] == 'None').cumsum().shift().fillna(0)).agg(d)
0 1 2 3 4 5 6 7
0
0.0 a 1 1 1 1 0 0 i
1.0 e 0 1 0 1 1 0 m
2.0 h 0 0 0 0 1 0 p

Create new column with max value with groupby

From the following dataframe, I am trying to add a new column, with the condition that for every id check the maximium value. Then place the maximium value for each row of every id in the new column.
df
id value
1 0
1 0
1 0
2 0
2 1
3 1
3 1
Expected result:
id value new_column
1 0 0
1 0 0
1 0 0
2 0 1
2 1 1
3 1 1
3 1 1
I have tried:
df['new_column'] = df.groupby(['id'])['value'].idxmax()
Or:
df['new_column'] = df.groupby(['id'])['value'].max()
But neither of these give the desired result.
You need to use transform for this:
df['new_column'] = df.groupby(['id'])['value'].transform('max')
This more succinctly replicates the following:
df['new_column'] = df['id'].map(df.groupby(['id'])['value'].max())
Remember that the result of a groupby operation is a series with index set to grouper column(s).
Since indices are not aligned between your original dataframe and the groupby object, assignment will not happen automatically.

Categories