summing rows based on one hot variables

summing rows based on one hot variables - python

I think the code below is OK but seems to clumsy. Basically, I want to go from here:
to here:
Basically adding column Result if the dummy column is 1. Hope this makes sense?
data = {'Dummy1':[0, 0, 1, 1],
'Dummy2':[1, 1, 0, 0],
'Result':[1, 1, 2, 2]}
haves = pd.DataFrame(data)
print(haves)
melted = pd.melt(haves, id_vars=['Result'])
melted = melted.loc[melted["value"] > 0]
print(melted)
wants = melted.groupby(["variable"])["Result"].sum()
print(wants)

No need to melt, perform a simple multiplication and sum:
wants = haves.drop('Result', axis=1).mul(haves['Result'], axis=0).sum()
output:
Dummy1 4
Dummy2 2
dtype: int64
Intermediate:
>>> haves.drop('Result', axis=1).mul(haves['Result'], axis=0)
Dummy1 Dummy2
0 0 1
1 0 1
2 2 0
3 2 0
Shorter variant
Warning: this mutates the original dataframe, which will lose the 'Result' column.
wants = haves.mul(haves.pop('Result'), axis=0).sum()

Related

add multiple columns programmatically from individual criteria/rules

I would like to add multiple columns programmatically to a dataframe using pre-defined rules. As an example, I would like to add 3 columns to the below dataframe, based on whether or not they satisfy the three rules indicated in code below:
#define dataframe
df1 = pd.DataFrame({"time1": [0, 1, 1, 0, 0],
"time2": [1, 0, 0, 0, 1],
"time3": [0, 0, 0, 1, 0],
"outcome": [1, 0, 0, 1, 0]})
#define "rules" for adding subsequent columns
rule_1 = (df1["time1"] == 1)
rule_2 = (df1["time2"] == 1)
rule_3 = (df1["time3"] == 1)
#add new columns based on whether or not above rules are satisfied
df1["rule_1"] = np.where(rule_1, 1, 0)
df1["rule_2"] = np.where(rule_2, 1, 0)
df1["rule_3"] = np.where(rule_3, 1, 0)
As you can see my approach gets tedious when I need to add 10s of columns - each based on a different "rule" - to a test dataframe.
Is there a way to do this more easily without defining each column manually along with its individual np.where clause? I tried doing something like this, but pandas does not accept this.
rules = [rule_1, rule_2, rule_3]
for rule in rules:
df1[rule] = np.where(rule, 1, 0)
Any ideas on how to make my approach more programmatically efficient?

The solution you provided doesn't work because you are using the rule element as the new dataframe column for the rule. I would solve it like this:
rules = [rule_1, rule_2, rule_3]
for i, rule in enumerate(rules):
df1[f'rule_{i+1}'] = np.where(rule, 1, 0)

Leverage pythons f strings in a for loop. They are good at this
#Create a list by filtering the time columns
cols=list(df1.filter(regex='time', axis=1).columns)
#Iterate through the list of columns imposing your conditions using np.where
for col in cols:
df1[f'{col}_new'] = df1[f'{col}'].apply(lambda x:np.where(x==1,1,0))

I might be oversimplifying your rules, but something like:
rules = [
('item1', 1),
('item2', 1),
('item3', 1),
]
for i, (col, val) in enumerate(rules):
df[f"rule_{i + 1}"] = np.where(df[col] == val, 1, 0)

If all of your rules check the same thing, maybe this could be helpful: unstack the relevant columns and check the condition on the Series and convert back to DataFrame with unstack:
df1[['rule1','rule2','rule3']] = df1[['time1','time2','time3']].unstack().eq(1).astype(int).swaplevel().unstack()
Output:
time1 time2 time3 outcome rule1 rule2 rule3
0 0 1 0 1 0 1 0
1 1 0 0 0 1 0 0
2 1 0 0 0 1 0 0
3 0 0 1 1 0 0 1
4 0 1 0 0 0 1 0

Replace values by result of a function

I have following dataframe table:
df = pd.DataFrame({'A': [0, 1, 0],
'B': [1, 1, 1]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
I'm trying to achieve that every value where 1 is present will be replaced by an increasing number. I'm looking for something like:
df.replace(1, value=3)
that works great but instead of number 3 I need number to be increasing (as I want to use it as ID)
number += 1
If I join those together, it doesn't work (or at least I'm not able to find correct syntax) I'd like to obtain following result:
df = pd.DataFrame({'A': [0, 2, 0],
'B': [1, 3, 4]},
index=['2020-01-01', '2020-02-01', '2020-03-01'])
Note: I can not use any command that relies on specification of column or row name, because table has 2600 columns and 5000 rows.

Element-wise assignment on a copy of df.values can work.
More specifically, a range starting from 1 to the number of 1's (inclusive) is assigned onto the location of 1 elements in the value array. The assigned array is then put back into the original dataframe.
Code
(Data as given)
1. Row-first ordering (what the OP wants)
arr = df.values
mask = (arr > 0)
arr[mask] = range(1, mask.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr[:, i]
# Result
print(df)
A B
2020-01-01 0 1
2020-02-01 2 3
2020-03-01 0 4
2. Column-first ordering (another possibility)
arr_tr = df.values.transpose()
mask_tr = (arr_tr > 0)
arr_tr[mask_tr] = range(1, mask_tr.sum() + 1)
for i, col in enumerate(df.columns):
df[col] = arr_tr[i, :]
# Result
print(df)
A B
2020-01-01 0 2
2020-02-01 1 3
2020-03-01 0 4

Find the minimum value of a column greater than another column value in Python Pandas

I'm working in Python. I have two dataframes df1 and df2:
d1 = {'timestamp1': [88148 , 5617900, 5622548, 5645748, 6603950, 6666502], 'col01': [1, 2, 3, 4, 5, 6]}
df1 = pd.DataFrame(d1)
d2 = {'timestamp2': [5629500, 5643050, 6578800, 6583150, 6611350], 'col02': [7, 8, 9, 10, 11], 'col03': [0, 1, 0, 0, 1]}
df2 = pd.DataFrame(d2)
I want to create a new column in df1 with the value of the minimum timestamp of df2 greater than the current df1 timestamp, where df2['col03'] is zero. This is the way I did it:
df1['colnew'] = np.nan
TSs = df1['timestamp1']
for TS in TSs:
values = df2['timestamp2'][(df2['timestamp2'] > TS) & (df2['col03']==0)]
if not values.empty:
df1.loc[df1['timestamp1'] == TS, 'colnew'] = values.iloc[0]
It works, but I'd prefer not to use a for loop. Is there a better way to do this?

Use pandas.merge_asof with a forward direction
pd.merge_asof(
df1, df2.loc[df2.col03 == 0, ['timestamp2']],
left_on='timestamp1', right_on='timestamp2', direction='forward'
).rename(columns=dict(timestamp2='colnew'))
col01 timestamp1 colnew
0 1 88148 5629500.0
1 2 5617900 5629500.0
2 3 5622548 5629500.0
3 4 5645748 6578800.0
4 5 6603950 NaN
5 6 6666502 NaN

Give a try to the apply method.
def func(x):
values = df2['timestamp2'][(df2['timestamp2'] > x) & (df2['col03']==0)]
if not values.empty:
return values.iloc[0]
else:
np.NAN
df1["timestamp1"].apply(func)
You can create a separate function to do what has to be done.
The output is your new column
0 5629500.0
1 5629500.0
2 5629500.0
3 6578800.0
4 NaN
5 NaN
Name: timestamp1, dtype: float64
It is not an one-line solution, but it helps keeping things organised.

Creating Pivot DataFrame using Multiple Columns in Pandas

I have a pandas dataframe following the form in the example below:
data = {'id': [1,1,1,1,2,2,2,2,3,3,3], 'a': [-1,1,1,0,0,0,-1,1,-1,0,0], 'b': [1,0,0,-1,0,1,1,-1,-1,1,0]}
df = pd.DataFrame(data)
Now, what I want to do is create a pivot table such that for each of the columns except the id, I will have 3 new columns corresponding to the values. That is, for column a, I will create a_neg, a_zero and a_pos. Similarly, for b, I will create b_neg, b_zero and b_pos. The values for these new columns would correspond to the number of times those values appear in the original a and b column. The final dataframe should look like this:
result = {'id': [1,2,3], 'a_neg': [1, 1, 1],
'a_zero': [1, 2, 2], 'a_pos': [2, 1, 0],
'b_neg': [1, 1, 1], 'b_zero': [2,1,1], 'b_pos': [1,2,1]}
df_result = pd.DataFrame(result)
Now, to do this, I can do the following steps and arrive at my final answer:
by_a = df.groupby(['id', 'a']).count().reset_index().pivot('id', 'a', 'b').fillna(0).astype(int)
by_a.columns = ['a_neg', 'a_zero', 'a_pos']
by_b = df.groupby(['id', 'b']).count().reset_index().pivot('id', 'b', 'a').fillna(0).astype(int)
by_b.columns = ['b_neg', 'b_zero', 'b_pos']
df_result = by_a.join(by_b).reset_index()
However, I believe that that method is not optimal especially if I have a lot of original columns aside from a and b. Is there a shorter and/or more efficient solution for getting what I want to achieve here? Thanks.

A shorter solution, though still quite in-efficient:
In [11]: df1 = df.set_index("id")
In [12]: g = df1.groupby(level=0)
In [13]: g.apply(lambda x: x.apply(lambda x: x.value_counts())).fillna(0).astype(int).unstack(1)
Out[13]:
a b
-1 0 1 -1 0 1
id
1 1 1 2 1 2 1
2 1 2 1 1 1 2
3 1 2 0 1 1 1
Note: I think you should be aiming for the multi-index columns.
I'm reasonably sure I've seen a trick to remove the apply/value_count/fillna with something cleaner and more efficient, but at the moment it eludes me...

Shifting order of rows in Dataframe

I am trying to make the last two rows of my dataframe df the first two of my dataframe with the previous first row becoming the 3rd row after the shift. Its because I just added the rows [3,0.3232, 0, 0, 2,0.500], [6,0.3232, 0, 0, 2,0.500]. However, these get added to to the end of df and hence become the last two rows, when I want them to be the first two. I was just wondering how to do this.
df = df.T
df[0] = [3,0.3232, 0, 0, 2,0.500]
df[1] = [6,0.3232, 0, 0, 2,0.500]
df = df.T
df = df.reset_index()

You can just call reindex and pass the new desired order:
In [14]:
df = pd.DataFrame({'a':['a','b','c']})
df
Out[14]:
a
0 a
1 b
2 c
In [16]:
df.reindex([1,2,0])
Out[16]:
a
1 b
2 c
0 a
EDIT
Another method would be to use np.roll note that this returns a np.array so we have to explicitly select the columns from the df to overwrite them:
In [30]:
df = pd.DataFrame({'a':['a','b','c'], 'b':np.arange(3)})
df
Out[30]:
a b
0 a 0
1 b 1
2 c 2
In [42]:
df[df.columns] = np.roll(df, shift=-1, axis=0)
df
Out[42]:
a b
0 b 1
1 c 2
2 a 0
The axis=0 param seems to be necessary otherwise the column order is not preserved:
In [44]:
df[df.columns] = np.roll(df, shift=-1)
df
Out[44]:
a b
0 0 b
1 1 c
2 2 a

Unless I'm missing something, the easiest solution is just to add the new rows to the beginning in the first place:
existing_rows = pd.DataFrame( np.random.randn(4,3) )
new_rows = pd.DataFrame( np.random.randn(2,3) )
new_rows.append( existing_rows )
0 1 2
0 0.406690 -0.699925 0.449278
1 1.729282 0.387896 0.652381
0 0.091711 1.634247 0.749282
1 1.354132 -0.180248 -1.880638
2 -0.151871 -1.266152 0.333071
3 1.351072 -0.421404 -0.951583
If you really want to switch rows you can do as EdChum suggests. Another way is like this:
df.iloc[-2:].append( df.iloc[:-2] )
I think this is slightly simpler than np.roll as suggested by EdChum, but numpy is generally faster so I'd use np.roll if you care about speed. (And doing some quick tests on 1,000x3 data suggests it is about 3x to 4x faster than append.)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

summing rows based on one hot variables - python

Related

add multiple columns programmatically from individual criteria/rules

Replace values by result of a function

Find the minimum value of a column greater than another column value in Python Pandas

Creating Pivot DataFrame using Multiple Columns in Pandas

Shifting order of rows in Dataframe

Categories

Resources