I'm wondering how to efficiently loop through rows by groups. So like the following sample dataset shows, it includes 3 different students with their pass records in 3 months.
import pandas as pd
import numpy as np
df = pd.DataFrame({'student':'A A A B B B C C C'.split(),
'month':[1, 2, 3, 1, 2, 3, 1, 2, 3],
'pass':[0, 1, 0, 0, 0, 0, 1, 0, 0]})
print(df)
student month pass
0 A 1 0
1 A 2 1
2 A 3 0
3 B 1 0
4 B 2 0
5 B 3 0
6 C 1 1
7 C 2 0
8 C 3 0
I'd like to have a new column "pass_patch", which should be equal to "pass" at first. But when a student has "pass" as 1 then all of his "pass_patch" in the following months should be 1, like the following:
df = pd.DataFrame({'student':'A A A B B B C C C'.split(),
'month':[1, 2, 3, 1, 2, 3, 1, 2, 3],
'pass':[0, 1, 0, 0, 0, 0, 1, 0, 0],
'pass_patch':[0, 1, 1, 0, 0, 0, 1, 1, 1]})
print(df)
student month pass pass_patch
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
I did some searches and found iterrows might work, but was concerned it would be too slow to run for the whole dataset (around million of records). Would there be more efficient ways to realize that?
Any suggestions would be greatly appreciated.
Try with cummax
df['new'] = df.groupby('student')['pass'].cummax()
df
Out[78]:
student month pass new
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
What's the most efficient way to iterate by rows for each group of
rows
DON'T ITERATE MANUALLY
Manualy iteration should always be your last option to try, most often, there's always some better way to perfrom a required operation rather than doing the iteration.
You can groupby student, then call cumsum which will just sum the values iteratively, convert it to boolean then convert it back to int:
df['pass_patch'] = df.groupby('student')['pass'].cumsum().astype(bool).astype(int)
OUTPUT:
student month pass pass_patch
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
PS: In above solution, you can avoid .astype(bool).astype(int) part if there is not more than one 1s for pass for each group. You may also need to sort the dataframe on months for each student if they are not sorted, I have not added that part since the sample data you have provided is already in that order.
import pandas as pd
import numpy as np
df = pd.DataFrame({'student':'A A A B B B C C C'.split(),
'month':[1, 2, 3, 1, 2, 3, 1, 2, 3],
'pass':[0, 1, 0, 0, 0, 0, 1, 0, 0]})
First we search the month of first pass for students that actually passed at least once.
grp = df[df["pass"].eq(1)]\
.sort_values(["student", "month"])\
.groupby("student").head(1)
where grp looks like
student month pass
1 A 2 1
6 C 1 1
Then we merge the dataframes
df = pd.merge(df,
grp,
on=["student"],
how="left",
suffixes=(None, '_y'))
and df looks like
student month pass month_y pass_y
0 A 1 0 2.0 1.0
1 A 2 1 2.0 1.0
2 A 3 0 2.0 1.0
3 B 1 0 NaN NaN
4 B 2 0 NaN NaN
5 B 3 0 NaN NaN
6 C 1 1 1.0 1.0
7 C 2 0 1.0 1.0
8 C 3 0 1.0 1.0
Finally we set 1 to all months greater or equals to month_y and 0 otherwise.
df["pass_patch"] = np.where(
df["month"].ge(df["month_y"]),
1,
0)
and we drop the columns we don't need anymore
df = df.drop(columns=["month_y", "pass_y"])
Which returns
student month pass pass_patch
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
You can replace 0 by pd.NA and then use ffill method, and then replace the null values back to 0:
df['pass_patch'] = df['pass'].replace(0, pd.NA)
df['pass_patch'] = df.groupby('student')['pass_patch']\
.transform(lambda x: x.ffill())\
.fillna(0)\
.astype(int)
Output:
student month pass pass_patch
0 A 1 0 0
1 A 2 1 1
2 A 3 0 1
3 B 1 0 0
4 B 2 0 0
5 B 3 0 0
6 C 1 1 1
7 C 2 0 1
8 C 3 0 1
I need to subtract two Data Frames with different indexes (which causes 'NaN' values when one of the values is missing) and I want to replace the missing values from each Data Frame with different number (fill value).
For example, let's say I have df1 and df2:
df1:
A B C
0 0 3 0
1 0 0 4
2 4 0 2
df2:
A B C
0 0 3 0
1 1 2 0
3 1 2 0
subtracted = df1.sub(df2):
A B C
0 0 0 0
1 -1 -2 4
2 NaN NaN NaN
3 NaN NaN NaN
I want the second row of subtracted to have the values from the second row in df1 and the third row of subtracted to have the value 5.
I expect -
subtracted:
A B C
0 0 0 0
1 -1 -2 4
2 4 0 2
3 5 5 5
I tried using the method sub with fill_value=5 but than in both rows 2 and 3 I'll get 0.
One way would be to reindex df2 setting fill_value to 0 before subtracting, then subtract and fillna with 5:
ix = pd.RangeIndex((df1.index|df2.index).max()+1)
df1.sub(df2.reindex(ix, fill_value=0)).fillna(5).astype(df1.dtypes)
A B C
0 0 0 0
1 -1 -2 4
2 4 0 2
3 5 5 5
We have to reindex here to get alligned indices. This way we can use the sub method.
idxmin = df2.index.min()
idxmax = df2.index.max()
idx = np.arange(idxmin, idxmax+1)
df1.reindex(idx).sub(df2.reindex(idx).fillna(0)).fillna(5)
A B C
0 0.0 0.0 0.0
1 -1.0 -2.0 4.0
2 4.0 0.0 2.0
3 5.0 5.0 5.0
I found the combine_first method that almost satisfies my needs:
df2.combine_first(df1).sub(df2, fill_value=0)
but still produces only:
A B C
0 0 0 0
1 0 0 0
2 4 0 2
3 0 0 0
I have df that is similar to the following:
value is_1 is_2 is_3
5 0 1 0
7 0 0 1
4 1 0 0
(it is guaranteed, that the sum of values from columns is_1 ... is_n is equal to 1 calculating by each row)
I need to get the next result:
is_1 is_2 is_3
0 5 0
0 0 7
4 0 0
(I should find the column is_k that is more than 0, and fill it with value from "value" column)
What is the best way to achieve it?
I'd do it this way:
In [16]: df = df.mul(df.pop('value').values, axis=0)
In [17]: df
Out[17]:
is_1 is_2 is_3
0 0 5 0
1 0 0 7
2 4 0 0
I want to apply cumsum on dataframe in pandas in python, but withouth zeros. Simply I want to leave zeros and do cumsum on dataframe. Suppose I have dataframe like this:
import pandas as pd
df = pd.DataFrame({'a' : [1,2,0,1],
'b' : [2,5,0,0],
'c' : [0,1,2,5]})
a b c
0 1 2 0
1 2 5 1
2 0 0 2
3 1 0 5
and result sould be
a b c
0 1 2 0
1 3 7 1
2 0 0 3
3 4 0 8
Any ideas how to do that avoiding loops? In R there is ave function, but Im very new to python and I dont know.
You can mask the df so that you only overwrite the non-zero cells:
In [173]:
df[df!=0] = df.cumsum()
df
Out[173]:
a b c
0 1 2 0
1 3 7 1
2 0 0 3
3 4 0 8
I am trying to convert a data set with 100,000 rows and 3 columns into pivot. While the following code runs without an error, the values are displayed as NaN.
df1 = pd.pivot_table(df_TEST, values='actions', index=['sku'], columns=['user'])
It is not taking the values (ranges from 1 to 36 ) from DataFrame. Has anyone come across this situation?
This can happen when you are doing a pivot since not all the values might be present. e.g.
In [10]: df_TEST
Out[10]:
a b c
0 0 0 0
1 0 1 0
2 0 2 0
3 1 1 1
4 1 2 3
5 1 4 5
Now, when you do pivot on this,
In [9]: df_TEST.pivot_table(index='a', values='c', columns='b')
Out[9]:
b 0 1 2 4
a
0 0 0 0 NaN
1 NaN 1 3 5
Note that, you got NaN at index 0 and column 4, since there is no entry in df_TEST with column a = 0 and column b = 4.
Typically you fill such values with zeros.
In [11]: df_TEST.pivot_table(index='a', values='c', columns='b').fillna(0)
Out[11]:
b 0 1 2 4
a
0 0 0 0 0
1 0 1 3 5