add multiple columns programmatically from individual criteria/rules - python

I would like to add multiple columns programmatically to a dataframe using pre-defined rules. As an example, I would like to add 3 columns to the below dataframe, based on whether or not they satisfy the three rules indicated in code below:
#define dataframe
df1 = pd.DataFrame({"time1": [0, 1, 1, 0, 0],
"time2": [1, 0, 0, 0, 1],
"time3": [0, 0, 0, 1, 0],
"outcome": [1, 0, 0, 1, 0]})
#define "rules" for adding subsequent columns
rule_1 = (df1["time1"] == 1)
rule_2 = (df1["time2"] == 1)
rule_3 = (df1["time3"] == 1)
#add new columns based on whether or not above rules are satisfied
df1["rule_1"] = np.where(rule_1, 1, 0)
df1["rule_2"] = np.where(rule_2, 1, 0)
df1["rule_3"] = np.where(rule_3, 1, 0)
As you can see my approach gets tedious when I need to add 10s of columns - each based on a different "rule" - to a test dataframe.
Is there a way to do this more easily without defining each column manually along with its individual np.where clause? I tried doing something like this, but pandas does not accept this.
rules = [rule_1, rule_2, rule_3]
for rule in rules:
df1[rule] = np.where(rule, 1, 0)
Any ideas on how to make my approach more programmatically efficient?

The solution you provided doesn't work because you are using the rule element as the new dataframe column for the rule. I would solve it like this:
rules = [rule_1, rule_2, rule_3]
for i, rule in enumerate(rules):
df1[f'rule_{i+1}'] = np.where(rule, 1, 0)

Leverage pythons f strings in a for loop. They are good at this
#Create a list by filtering the time columns
cols=list(df1.filter(regex='time', axis=1).columns)
#Iterate through the list of columns imposing your conditions using np.where
for col in cols:
df1[f'{col}_new'] = df1[f'{col}'].apply(lambda x:np.where(x==1,1,0))

I might be oversimplifying your rules, but something like:
rules = [
('item1', 1),
('item2', 1),
('item3', 1),
]
for i, (col, val) in enumerate(rules):
df[f"rule_{i + 1}"] = np.where(df[col] == val, 1, 0)

If all of your rules check the same thing, maybe this could be helpful: unstack the relevant columns and check the condition on the Series and convert back to DataFrame with unstack:
df1[['rule1','rule2','rule3']] = df1[['time1','time2','time3']].unstack().eq(1).astype(int).swaplevel().unstack()
Output:
time1 time2 time3 outcome rule1 rule2 rule3
0 0 1 0 1 0 1 0
1 1 0 0 0 1 0 0
2 1 0 0 0 1 0 0
3 0 0 1 1 0 0 1
4 0 1 0 0 0 1 0

Related

summing rows based on one hot variables

I think the code below is OK but seems to clumsy. Basically, I want to go from here:
to here:
Basically adding column Result if the dummy column is 1. Hope this makes sense?
data = {'Dummy1':[0, 0, 1, 1],
'Dummy2':[1, 1, 0, 0],
'Result':[1, 1, 2, 2]}
haves = pd.DataFrame(data)
print(haves)
melted = pd.melt(haves, id_vars=['Result'])
melted = melted.loc[melted["value"] > 0]
print(melted)
wants = melted.groupby(["variable"])["Result"].sum()
print(wants)
No need to melt, perform a simple multiplication and sum:
wants = haves.drop('Result', axis=1).mul(haves['Result'], axis=0).sum()
output:
Dummy1 4
Dummy2 2
dtype: int64
Intermediate:
>>> haves.drop('Result', axis=1).mul(haves['Result'], axis=0)
Dummy1 Dummy2
0 0 1
1 0 1
2 2 0
3 2 0
Shorter variant
Warning: this mutates the original dataframe, which will lose the 'Result' column.
wants = haves.mul(haves.pop('Result'), axis=0).sum()

How to detect if consumer was already in dataframe?

I have a following dataset that contains information if a consumer gave a recommendation or not:
data = {'customer_id': [1, 2, 1, 3], 'recommend': [0, 1, 1, 0]}
df = pd.DataFrame.from_dict(data)
I would like to know if a customer gave 0 recommendation in the past. Desired output would be:
data = {'customer_id': [1, 2, 1, 3], 'recommend': [0, 1, 1, 0], 'past': [0, 0, 1, 0]}
df = pd.DataFrame.from_dict(data)
How can I do it please?
You can do it by first adding a new column with your condition (recommend == 0) and then using groupby together with shift and cummax to obtain the wanted past column. Finally, drop the temporary column created.
Code:
df['equal_zero'] = (df['recommend'] == 0).astype(int)
df['past'] = df.groupby('customer_id')['equal_zero'].shift(1).cummax().fillna(0)
df = df.drop(columns=['equal_zero'])
Result:
customer_id recommend past
0 1 0 0.0
1 2 1 0.0
2 1 1 1.0
3 3 0 0.0
Use custom function per groups with shifting and cumulative max in GroupBy.transform:
df['past'] = (df['recommend'].eq(0)
.groupby(df['customer_id'])
.transform(lambda x: x.shift(fill_value=False).cummax())
.astype(int))
print (df)
customer_id recommend past
0 1 0 0
1 2 1 0
2 1 1 1
3 3 0 0
Assuming 'past' is a boolean, (1 if a customer gave a zero in the past, 0 else).
Here is a one-line solution :
df['past'] = df.apply(lambda x: 1 if len(df[(df.customer_id == x.customer_id) & (df.index < x.name) & (df.recommend == 0)]) > 0 else 0, axis=1)
If 'past' is a count value :
df['past'] = df.apply(lambda x: len(df[(df.customer_id == x.customer_id) & (df.index < x.name) & (df.recommend == 0)]), axis=1)

Replace values by result of a function

I have following dataframe table:
df = pd.DataFrame({'A': [0, 1, 0],
'B': [1, 1, 1]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
I'm trying to achieve that every value where 1 is present will be replaced by an increasing number. I'm looking for something like:
df.replace(1, value=3)
that works great but instead of number 3 I need number to be increasing (as I want to use it as ID)
number += 1
If I join those together, it doesn't work (or at least I'm not able to find correct syntax) I'd like to obtain following result:
df = pd.DataFrame({'A': [0, 2, 0],
'B': [1, 3, 4]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
Note: I can not use any command that relies on specification of column or row name, because table has 2600 columns and 5000 rows.
Element-wise assignment on a copy of df.values can work.
More specifically, a range starting from 1 to the number of 1's (inclusive) is assigned onto the location of 1 elements in the value array. The assigned array is then put back into the original dataframe.
Code
(Data as given)
1. Row-first ordering (what the OP wants)
arr = df.values
mask = (arr > 0)
arr[mask] = range(1, mask.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr[:, i]
# Result
print(df)
A B
2020-01-01 0 1
2020-02-01 2 3
2020-03-01 0 4
2. Column-first ordering (another possibility)
arr_tr = df.values.transpose()
mask_tr = (arr_tr > 0)
arr_tr[mask_tr] = range(1, mask_tr.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr_tr[i, :]
# Result
print(df)
A B
2020-01-01 0 2
2020-02-01 1 3
2020-03-01 0 4

compare column row wise in dataframe

I have a pandas data frame
sample dataframe
df = a1 a2 a3 a4 a5
0 1 1 1 0 #dict[a3_a4] = 1 ,dict[a2_a4] = 1 ,dict[a2_a3] = 1
1 1 1 0 0 #dict[a1_a2] = 1 , dict[a1_a3] = 1, dict[a2_a3] = 1
I need function gets data frame as input and return the number of appearing of 2 columns together and store it in the dictionary
so my output will be like
output dict will look like this : {'a1_a2':1,'a2_a3':2, 'a3_a4':1,'a1_a3':1,'a2_a4':1}
Pseudo code if needed
PS: I am new to stack overflow so forgive me for my mistakes.
You can use itertools combinations to get all the pairs of columns. Then you can multiply up the values and take the sum of them.
from itertools import combinations
cc = list(combinations(df.columns,2))
df1 = pd.concat([df[c[1]]*df[c[0]] for c in cc], axis=1, keys=cc)
df1.columns = df1.columns.map('_'.join)
d = df1.sum().to_dict()
print(d)
Output:
{'a1_a2': 1,
'a1_a3': 1,
'a1_a4': 0,
'a1_a5': 0,
'a2_a3': 2,
'a2_a4': 1,
'a2_a5': 0,
'a3_a4': 1,
'a3_a5': 0,
'a4_a5': 0}

Restructuring CSV into Pandas DataFrame

I've got a CSV with a rather messy format:
t, 01_x, 01_y, 02_x, 02_y
0, 0, 1, ,
1, 1, 1, 0, 0
Thereby "01_" and "02_" are numbers of entities (1, 2), which can vary from file to file and there might be additional columns too (but at least the same for all entities).
Note also that entity 2 enters the scene at t=1 (no entries at t=0).
I already import the CSV into a pandas dataframe, but don't see the way to transform the stuff into the following form:
t, entity, x, y
0, 1, 0, 1
1, 1, 1, 1
1 2, 0, 0
Is there a simple (pythonic) way to transform that?
Thanks!
René
This is wide_to_long, but we need to first swap the order of your column names around the '_'
df.columns = ['_'.join(x.split('_')[::-1]) for x in df.columns]
#Index(['t', 'x_01', 'y_01', 'x_02', 'y_02'], dtype='object')
(pd.wide_to_long(df, i='t', j='entity', stubnames=['x', 'y'], sep='_')
.dropna()
.reset_index())
t entity x y
0 0 1 0.0 1.0
1 1 1 1.0 1.0
2 1 2 0.0 0.0

Categories