I have a pandas data frame:
df12 = pd.DataFrame({'group_ids':[1,1,1,2,2,2],'dates':['2016-04-01','2016-04-20','2016-04-28','2016-04-05','2016-04-20','2016-04-29'],'event_today_in_group':[1,0,1,1,1,0]})
group_ids dates event_today_in_group
0 1 2016-04-01 1
1 1 2016-04-20 0
2 1 2016-04-28 1
3 2 2016-04-05 1
4 2 2016-04-20 1
5 2 2016-04-29 0
I would like to compute an additional column that contains, for each group_ids, the number of days since the last time event_today_in_group was 1.
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 0
1 1 2016-04-20 0 19
2 1 2016-04-28 1 27
3 2 2016-04-05 1 0
4 2 2016-04-20 1 15
5 2 2016-04-29 0 9
As I mentioned earlier, this will get you the non-cumulative difference between dates within each group:
df['days_since_last_event'] = df.groupby('group_ids')['dates'].diff().apply(lambda x: x.days)
In order to get a cumulative sum of this difference, based on whenever event_today_in_group changes, I propose using shift to get the value of the previous row, and then generating a cumulative sum, like so:
df['event_today_in_group'].shift().cumsum()
Output:
0 NaN
1 1.0
2 1.0
3 2.0
4 3.0
5 4.0
This gives us the second grouping value we need to get the cumulative sums. You could assign the above values to a new column, but if you're only using them for the calculation, then you can simply include them in the subsequent groupby operation like so:
df.loc[:, 'days_since_last_event'] = df.groupby(['group_ids', df['event_today_in_group'].shift().cumsum()])['days_since_last_event'].cumsum()
Result:
group_ids dates event_today_in_group days_since_last_event
0 1 2016-04-01 1 NaN
1 1 2016-04-20 0 19.0
2 1 2016-04-28 1 27.0
3 2 2016-04-05 1 NaN
4 2 2016-04-20 1 15.0
5 2 2016-04-29 0 9.0
Related
I have this dataset, which contains some NaN values:
df = pd.DataFrame({'Id':[1,2,3,4,5,6], 'Name':['Eve','Diana',np.NaN,'Mia','Mae',np.NaN], "Count":[10,3,np.NaN,8,5,2]})
df
Id Name Count
0 1 Eve 10.0
1 2 Diana 3.0
2 3 NaN NaN
3 4 Mia 8.0
4 5 Mae 5.0
5 6 NaN 2.0
I want to test if the column has a NaN value (0) or not (1) and creating two new columns. I have tried this:
df_clean = df
df_clean[['Name_flag','Count_flag']] = df_clean[['Name','Count']].apply(lambda x: 0 if x == np.NaN else 1, axis = 1)
But it mentions that The truth value of a Series is ambiguous. I want to make it avoiding redundancy, but I see there is a mistake in my logic. Please, could you help me with this question?
The expected table is:
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 1 1
1 2 Diana 3.0 1 1
2 3 NaN NaN 0 0
3 4 Mia 8.0 1 1
4 5 Mae 5.0 1 1
5 6 NaN 2.0 0 1
Multiply boolean mask by 1:
df[['Name_flag','Count_flag']] = df[['Name', 'Count']].isna() * 1
>>> df
Id Name Count Name_flag Count_flag
0 1 Eve 10.0 0 0
1 2 Diana 3.0 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.0 0 0
4 5 Mae 5.0 0 0
5 6 NaN 2.0 1 0
For your problem of The truth value of a Series is ambiguous
For apply, you cannot return a scalar 0 or 1 because you have a series as input . You have to use applymap instead to apply a function elementwise. But comparing to NaN is not an easy thing:
Try:
df[['Name','Count']].applymap(lambda x: str(x) == 'nan') * 1
We can use isna and convert the boolean to int:
df[["Name_flag", "Count_flag"]] = df[["Name", "Count"]].isna().astype(int)
Id Name Count Name_flag Count_flag
0 1 Eve 10.00 0 0
1 2 Diana 3.00 0 0
2 3 NaN NaN 1 1
3 4 Mia 8.00 0 0
4 5 Mae 5.00 0 0
5 6 NaN 2.00 1 0
I am trying to impute/fill values using rows with similar columns' values.
For example, I have this dataframe:
one | two | three
1 1 10
1 1 nan
1 1 nan
1 2 nan
1 2 20
1 2 nan
1 3 nan
1 3 nan
I wanted to using the keys of column one and two which is similar and if column three is not entirely nan then impute the existing value from a row of similar keys with value in column '3'.
Here is my desired result:
one | two | three
1 1 10
1 1 10
1 1 10
1 2 20
1 2 20
1 2 20
1 3 nan
1 3 nan
You can see that keys 1 and 3 do not contain any value because the existing value does not exists.
I have tried using groupby+fillna():
df['three'] = df.groupby(['one','two'])['three'].fillna()
which gave me an error.
I have tried forward fill which give me rather strange result where it forward fill the column 2 instead. I am using this code for forward fill.
df['three'] = df.groupby(['one','two'], sort=False)['three'].ffill()
If only one non NaN value per group use ffill (forward filling) and bfill (backward filling) per group, so need apply with lambda:
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.ffill().bfill())
print (df)
one two three
0 1 1 10.0
1 1 1 10.0
2 1 1 10.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
But if multiple value per group and need replace NaN by some constant - e.g. mean by group:
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 NaN
3 1 2 NaN
4 1 2 20.0
5 1 2 NaN
6 1 3 NaN
7 1 3 NaN
df['three'] = df.groupby(['one','two'], sort=False)['three']
.apply(lambda x: x.fillna(x.mean()))
print (df)
one two three
0 1 1 10.0
1 1 1 40.0
2 1 1 25.0
3 1 2 20.0
4 1 2 20.0
5 1 2 20.0
6 1 3 NaN
7 1 3 NaN
You can sort data by the column with missing values then groupby and forwardfill:
df.sort_values('three', inplace=True)
df['three'] = df.groupby(['one','two'])['three'].ffill()
I am working on a data set with the following columns:
order_id
order_item_id
product mrp
units
sale_date
I want to create a new column which shows how much the mrp changed from the last time this product was. This there a way I can do this with pandas data frame?
Sorry if this question is very basic but I am pretty new to pandas.
Sample data:
expected data:
For each row of the data I want to check the amount of price change for the last time the product was sold.
You can do this as follows:
# define a function that applies rolling window calculationg
# taking the difference between the last value and the current
# value
def calc_mrp(ser):
# in case you want the relative change, just
# divide by x[1] or x[0] in the lambda function
return ser.rolling(window=2).apply(lambda x: x[1]-x[0])
# apply this to the grouped 'product_mrp' column
# and store the result in a new column
df['mrp_change']=df.groupby('product_id')['product_mrp'].apply(calc_mrp)
If this is executed on a dataframe like:
Out[398]:
order_id product_id product_mrp units_sold sale_date
0 0 2 647.169280 8 2019-08-23
1 1 0 500.641188 0 2019-08-24
2 2 1 647.789399 15 2019-08-25
3 3 0 381.278167 12 2019-08-26
4 4 2 373.685000 7 2019-08-27
5 5 4 553.472850 2 2019-08-28
6 6 4 634.482718 7 2019-08-29
7 7 3 536.760482 11 2019-08-30
8 8 0 690.242274 6 2019-08-31
9 9 4 500.515521 0 2019-09-01
It yields:
Out[400]:
order_id product_id product_mrp units_sold sale_date mrp_change
0 0 2 647.169280 8 2019-08-23 NaN
1 1 0 500.641188 0 2019-08-24 NaN
2 2 1 647.789399 15 2019-08-25 NaN
3 3 0 381.278167 12 2019-08-26 -119.363022
4 4 2 373.685000 7 2019-08-27 -273.484280
5 5 4 553.472850 2 2019-08-28 NaN
6 6 4 634.482718 7 2019-08-29 81.009868
7 7 3 536.760482 11 2019-08-30 NaN
8 8 0 690.242274 6 2019-08-31 308.964107
9 9 4 500.515521 0 2019-09-01 -133.967197
The NaNs are in the rows, for which there is not previous order with the same product_id.
I have the following dataframe df:
bucket_value is_new_bucket
dates
2019-03-07 0 1
2019-03-08 1 0
2019-03-09 2 0
2019-03-10 3 0
2019-03-11 4 0
2019-03-12 5 1
2019-03-13 6 0
2019-03-14 7 1
I want to apply a specific function (let’s say the mean function) to each bucket_value data groups where the column is_new_bucket is equal to zero, such that the resulting dataframe would look like this:
mean_values
dates
2019-03-08 2.5
2019-03-13 6.0
In other words, applying a function to the consecutive rows where is_new_bucket = 0, which takes the bucket_value as input.
For instance, if I want to apply the max function, the resulting dataframe would look like this:
max_values
dates
2019-03-11 4.0
2019-03-13 6.0
Using cumsum with filter
df.reset_index(inplace=True)
s=df.loc[df.is_new_bucket==0].groupby(df.is_new_bucket.cumsum()).agg({'date':'first','bucket_value':['mean','max']})
s
date bucket_value
first mean max
is_new_bucket
1 2019-03-08 2.5 4
2 2019-03-13 6.0 6
Updated
df.loc[df.loc[df.is_new_bucket==0].groupby(df.is_new_bucket.cumsum())['bucket_value'].idxmax()]
date bucket_value is_new_bucket
4 2019-03-11 4 0
6 2019-03-13 6 0
Updated2 after using the cumsum create the group key Newkey , you can do whatever you need , base on the groupkey
df['Newkey']=df.is_new_bucket.cumsum()
df
date bucket_value is_new_bucket Newkey
0 2019-03-07 0 1 1
1 2019-03-08 1 0 1
2 2019-03-09 2 0 1
3 2019-03-10 3 0 1
4 2019-03-11 4 0 1
5 2019-03-12 5 1 2
6 2019-03-13 6 0 2
7 2019-03-14 7 1 3
I have a pandas DataFrame of the form:
df
ID_col time_in_hours data_col
1 62.5 4
1 40 3
1 20 3
2 30 1
2 20 5
3 50 6
What I want to be able to do is, find the rate of change of data_col by using the time_in_hours column. Specifically,
rate_of_change = (data_col[i+1] - data_col[i]) / abs(time_in_hours[ i +1] - time_in_hours[i])
Where i is a given row and the rate_of_change is calculated separately for different IDs
Effectively, I want a new DataFrame of the form:
new_df
ID_col time_in_hours data_col rate_of_change
1 62.5 4 NaN
1 40 3 -0.044
1 20 3 0
2 30 1 NaN
2 20 5 0.4
3 50 6 NaN
How do I go about this?
You can use groupby:
s = df.groupby('ID_col').apply(lambda dft: dft['data_col'].diff() / dft['time_in_hours'].diff().abs())
s.index = s.index.droplevel()
s
returns
0 NaN
1 -0.044444
2 0.000000
3 NaN
4 0.400000
5 NaN
dtype: float64
You can actually get around the groupby + apply given how your DataFrame is sorted. In this case, you can just check if the ID_col is the same as the shifted row.
So calculate the rate of change for everything, and then only assign the values back if they are within a group.
import numpy as np
mask = df.ID_col == df.ID_col.shift(1)
roc = (df.data_col - df.data_col.shift(1))/np.abs(df.time_in_hours - df.time_in_hours.shift(1))
df.loc[mask, 'rate_of_change'] = roc[mask]
Output:
ID_col time_in_hours data_col rate_of_change
0 1 62.5 4 NaN
1 1 40.0 3 -0.044444
2 1 20.0 3 0.000000
3 2 30.0 1 NaN
4 2 20.0 5 0.400000
5 3 50.0 6 NaN
You can use pandas.diff:
df.groupby('ID_col').apply(
lambda x: x['data_col'].diff() / x['time_in_hours'].diff().abs())
ID_col
1 0 NaN
1 -0.044444
2 0.000000
2 3 NaN
4 0.400000
3 5 NaN
dtype: float64