Python: combine boolean columns in Pandas dataframes - python

I have the following data
attr1_A attr1_B attr1_C attr1_D attr2_A attr2_B attr2_C
1 0 0 1 1 0 0
0 1 1 0 0 0 1
0 0 0 0 0 1 0
1 1 1 0 1 1 0
I want to retain attr1_A, attr1_B and combine attr1_C and attr1_D into attr1_others. As long as attr1_C and/or attr1_D is 1, then attr1_others will be 1. Similarly, I want to keep attr2_A but combine the remaining attr2_* into attr2_others. Like this:
attr1_A attr1_B attr1_others attr2_A attr2_others
1 0 1 1 0
0 1 1 0 1
0 0 0 0 1
1 1 1 1 1
In other words, for any group of attr, I want to retain a few known columns but combine the remaining (which I don't know how many remaining attr of the same group.
I am thinking of doing each group separately: processing all attr1_*, and then attr2_* because there are a limited number of groups in my dataset, but many attr under each group.
What I can think right now is to retrieve the others columns like:
# for group 1
df[x for x in df.columns if "A" not in x and "B" not in x and "attr1_" in x]
# for group 2
df[x for x in df.columns if "A" not in x and "attr2_" in x]
And to combine, I am thinking of using any function, but I can't come up with the syntax. Could you help?
Updated attempt:
I tried this
# for group 1
df['attr1_others'] = df[df[[x for x in list(df.columns)
if "attr1_" in x
and "A" not in x
and "B" not in x]].any(axis = 'column')]
but got the below error:
ValueError: No axis named column for object type <class 'pandas.core.frame.DataFrame'>

Dataframes have the great ability to manipulate data in place, without having to write complex python logic.
To create your attr1_others and attr2_others columns, you can combine the columns with or conditions using this:
df['attr1_others'] = df['attr1_C'] | df['attr1_D']
df['attr2_others'] = df['attr2_B'] | df['attr2_C']
If instead, you wanted an and condition, you could use:
df['attr1_others'] = df['attr1_C'] & df['attr1_D']
df['attr2_others'] = df['attr2_B'] & df['attr2_C']
You can then delete the lingering original values using del:
del df['attr1_C']
del df['attr1_D']
del df['attr2_B']
del df['attr2_C']

Create a list of kept-columns. Drop those kept-columns out and assign left-over columns to new dataframe df1. Groupby df1 by the splitted column names; call any on axis=1; add_suffix '_others' and assign result to df2. Finally, join and sort_index
keep_cols = ['attr1_A', 'attr1_B', 'attr2_A']
df1 = df.drop(keep_cols,1)
df2 = (df1.groupby(df1.columns.str.split('_').str[0], axis=1)
.any(1).add_suffix('_others').astype(int))
Out[512]:
attr1_others attr2_others
0 1 0
1 1 1
2 0 1
3 1 1
df_final = df[keep_cols].join(df2).sort_index(1)
Out[514]:
attr1_A attr1_B attr1_others attr2_A attr2_others
0 1 0 1 1 0
1 0 1 1 0 1
2 0 0 0 0 1
3 1 1 1 1 1

You can use custom list to select columns, and then .any() with axis=1 parameter. To convert to interger, use .astype(int).
For example:
import pandas as pd
df = pd.DataFrame({
'attr1_A': [1, 0, 0, 1],
'attr1_B': [0, 1, 0, 1],
'attr1_C': [0, 1, 0, 1],
'attr1_D': [1, 0, 0, 0],
'attr2_A': [1, 0, 0, 1],
'attr2_B': [0, 0, 1, 1],
'attr2_C': [0, 1, 0, 0]})
cols = [col for col in df.columns.values if col.startswith('attr1') and col.split('_')[1] not in ('A', 'B')]
df['attr1_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
cols = [col for col in df.columns.values if col.startswith('attr2') and col.split('_')[1] not in ('A', )]
df['attr2_others'] = df[cols].any(axis=1).astype(int)
df.drop(cols, axis=1, inplace=True)
print(df)
Prints:
attr1_A attr1_B attr2_A attr1_others attr2_others
0 1 0 1 1 0
1 0 1 0 1 1
2 0 0 0 0 1
3 1 1 1 1 1

Related

Find the rows that share the value

I need to find where the rows in ABC all have the value 1 and then create a new column that has the result.
my idea is to use np.where() with some condition, but I don't know the correct way of dealing with this problem, from what I have read I'm not supposed to iterate through a dataframe, but use some of the pandas creative methods?
df1 = pd.DataFrame({'A': [0, 1, 1, 0],
'B': [1, 1, 0, 1],
'C': [0, 1, 1, 1],},
index=[0, 1, 2, 4])
print(df1)
what I am after is this:
A B C TRUE
0 0 1 0 0
1 1 1 1 1 <----
2 1 0 1 0
4 0 1 1 0
If the data is always 0/1, you can simply take the product per row:
df1['TRUE'] = df1.prod(1)
output:
A B C TRUE
0 0 1 0 0
1 1 1 1 1
2 1 0 1 0
4 0 1 1 0
This is what you are looking for:
df1["TRUE"] = (df1==1).all(axis=1).astype(int)

Add another column based on the value of two columns

I am trying to add another column based on the value of two columns. Here is the mini version of my dataframe.
data = {'current_pair': ['"["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]"', '"["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]"', '"["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]"','"["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]"', '"["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]"'],
'B': [1, 0, 1, 1, 0]
}
df = pd.DataFrame(data)
df
current_pair B
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0
I want the result to be:
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 1
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 1
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
I used the numpy select commands:
conditions=[(data['B']==1 & data['current_pair'].str.contains('Emo/', na=False)),
(data['B']==1 & data['current_pair'].str.contains('Neu/', na=False)),
data['B']==0]
choices = [0, 1, 2]
data['C'] = np.select(conditions, choices, default=np.nan)
Unfortunately, it gives me this dataframe without recognizing anything with "1" in column "C".
current_pair B C
0 "["StimusNeu/2357.jpg","StimusNeu/5731.jpg"]" 1 0
1 "["StimusEmo/6350.jpg","StimusEmo/3230.jpg"]" 0 2
2 "["StimusEmo/3215.jpg","StimusEmo/9570.jpg"]" 1 0
3 "["StimusNeu/7020.jpg","StimusNeu/7547.jpg"]" 1 0
4 "["StimusNeu/7080.jpg","StimusNeu/7179.jpg"]" 0 2
Any help counts! thanks a lot.
There is problem with () after ==1 for precedence of operators:
conditions=[(data['B']==1) & data['current_pair'].str.contains('Emo/', na=False),
(data['B']==1) & data['current_pair'].str.contains('Neu/', na=False),
data['B']==0]
I think some logic went wrong here; this works:
df.assign(C=np.select([df.B==0, df.current_pair.str.contains('Emo/'), df.current_pair.str.contains('Neu/')], [2,0,1]))
Here is a slightly more generalized suggestion, easily applicable to more complex cases. You should, however mind execution speed:
import pandas as pd
df = pd.DataFrame({'col_1': ['Abc', 'Xcd', 'Afs', 'Xtf', 'Aky'], 'col_2': [1, 2, 3, 4, 5]})
def someLogic(col_1, col_2):
if 'A' in col_1 and col_2 == 1:
return 111
elif "X" in col_1 and col_2 == 4:
return 999
return 888
df['NewCol'] = df.apply(lambda row: someLogic(row.col_1, row.col_2), axis=1, result_type="expand")
print(df)

Replace the max value for each column to 0 in Pandas

For example, I have a data set of this:
data = {
"A": [1, 2, 3],
"B": [3, 5, 1],
"C": [9, 0, 1]
}
data_df = pd.DataFrame(data)
data_df
A B C
0 1 3 9
1 2 5 0
2 3 1 1
I want to replace the max value for each columns to 0. My desired output is:
A B C
0 1 3 0
1 2 0 0
2 0 1 1
Thank you in advance!
You can interate through columns, get the max value and replace row with max value:
for col in data_df.columns:
data_df[col] = data_df[col].apply(lambda x: 0 if x==data_df.max()[col] else x)
This works if your max value is unique.
Just be aware that idxmax() returns the first index of the maximum value. If the values occurs more often this won't work.
for col in df.columns:
df.loc[df.idxmax()[col], col] = 0

Replace values in each cell based on other rows using lambda and apply i python

I'm trying to replace the values in each cell with 1 if the value is equal to highest value in other columns in the row.
This is the data i have
This is where i want to get to
This is what i tried so far:
df_ref['max'] = df_ref.max(axis=1)
df_ref['col1'] = df_ref.col1.apply(lambda x:1 if (x==df_ref['max']) else 0)
Thanks in advance
you're almost there, you don't need the max column just apply it within your lambda function and use .any(), you also need your process within a loop over columns:
import pandas as pd
#data
d = {'col1': [0, 1, 0.170531, 0.170533, 0.170531],
'col2': [0, 0, 0.005285, 0.005285, 0.005285],
'col3': [0, 0, 0.047557, 0.047557, 0.047557],
'col4': [1, 0, 0.482381, 0.003104, 0.482381],
'col5': [0, 0, 0.003104, 0.482458, 0.003104],
'col6': [0, 0, 0.001109, 0.001108, 0.001109]}
#create dataframe
df = pd.DataFrame(data = d)
#list of columns
columns = df.columns.tolist()
#loop over columns
for col in columns:
#change to 1 if value equals to the max in that row
df[col] = df[col].apply(lambda x:1 if (x==df.max(axis=1)).any() else 0)
print(df)
col1 col2 col3 col4 col5 col6
0 0 0 0 1 0 0
1 1 0 0 0 0 0
2 0 0 0 1 0 0
3 0 0 0 0 1 0
4 0 0 0 1 0 0

group values in intervals

I have a pandas series containing zeros and ones:
df1 = pd.Series([ 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0])
df1
Out[3]:
0 0
1 0
2 0
3 0
4 0
5 1
6 1
7 1
8 0
9 0
10 0
I would like to create a dataframe df2 that contains the start and the end of intervals with the same value, together with the value associated... df2 in this case should be...
df2
Out[5]:
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
My attempt was:
from operator import itemgetter
from itertools import groupby
a=[next(group) for key, group in groupby(enumerate(df1), key=itemgetter(1))]
df2 = pd.DataFrame(a,columns=['Start','Value'])
but I don't know how to get the 'End' indeces
You can groupby by Series which is create by cumsum of shifted Series df1 by shift.
Then apply custum function and last reshape by unstack.
s = df1.ne(df1.shift()).cumsum()
df2 = df1.groupby(s).apply(lambda x: pd.Series([x.index[0], x.index[-1], x.iat[0]],
index=['Start','End','Value']))
.unstack().reset_index(drop=True)
print (df2)
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
Another solution with aggregation by agg with first and last, but there is necessary more code for handling output by desired output.
s = df1.ne(df1.shift()).cumsum()
d = {'first':'Start','last':'End'}
df2 = df1.reset_index(name='Value') \
.groupby([s, 'Value'])['index'] \
.agg(['first','last']) \
.reset_index(level=0, drop=True) \
.reset_index() \
.rename(columns=d) \
.reindex_axis(['Start','End','Value'], axis=1)
print (df2)
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
You could use the pd.Series.diff() method so as to identify the starting indexes:
df2 = pd.DataFrame()
df2['Start'] = df1[df1.diff().fillna(1) != 0].index
Then compute end indexes from this:
df2['End'] = [e - 1 for e in df2['Start'][1:]] + [df1.index.max()]
And finally gather the associated values :
df2['Value'] = df1[df2['Start']].values
ouput
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0
The thing you are looking for is get first and last values in a groupby
import pandas as pd
def first_last(df):
return df.ix[[0,-1]]
df = pd.DataFrame([3]*4+[4]*4+[1]*4+[3]*3,columns=['value'])
print df
df['block'] = (df.value.shift(1) != df.value).astype(int).cumsum()
df = df.reset_index().groupby(['block','value'])['index'].agg(['first', 'last']).reset_index()
del df['block']
print df
You can groupby using shift and cumsum and find first and last valid index
df2 = df1.groupby((df1 != df1.shift()).cumsum()).apply(lambda x: np.ravel([x.index[0], x.index[-1], x.unique()]))
df2 = pd.DataFrame(df2.values.tolist()).rename(columns = {0: 'Start', 1: 'End',2:'Value'})
You get
Start End Value
0 0 4 0
1 5 7 1
2 8 10 0

Categories