Write column backwards with condition in Python - python

I have the following df and want to write the number column backwards and also overwrite other values if necessary. The condition is to always use the previous value unless the new values difference to the old value is greater than 10%.
Date Number
2019 150
2018 NaN
2017 118
2016 NaN
2015 115
2014 107
2013 105
2012 NaN
2011 100
Because of the condition the value in e.g. 2013 is equal to 100, because it is not smaller than 90 and not greater than 110. The result would look like this:
Date Number
2019 150
2018 115
2017 115
2016 115
2015 115
2014 100
2013 100
2012 100
2011 100

You can reverse your column and then apply a function to update values. Finally reverse the column to the original order:
def get_val(x):
global prev_num
if x and x > prev_num*1.1:
prev_num = x
return prev_num
prev_num = 0
df['number'] = df['number'][::-1].apply(get_val)[::-1]

Just groupby the difference after floor division by 10 which is not equal to zero then transform the min i.e
df['x'] = df.groupby((df['number'].bfill()[::-1]//10).diff().ne(0).cumsum())['number'].transform(min)
date number x
0 2019 150.0 150.0
1 2018 NaN 115.0
2 2017 118.0 115.0
3 2016 NaN 115.0
4 2015 115.0 115.0
5 2014 107.0 100.0
6 2013 105.0 100.0
7 2012 NaN 100.0
8 2011 100.0 100.0
​

Here is one way. It assumes the first value 100 is not NaN and the original dataframe is ordered descending by year. If performance is an issue, the loop can be converted to a list comprehension.
lst = df.sort_values('date')['number'].ffill().tolist()
for i in range(1, len(lst)):
if abs(lst[i] - lst[i-1]) / lst[i] <= 0.10:
lst[i] = lst[i-1]
df['number'] = list(reversed(lst))
# date number
# 0 2019 150.0
# 1 2018 115.0
# 2 2017 115.0
# 3 2016 115.0
# 4 2015 115.0
# 5 2014 100.0
# 6 2013 100.0
# 7 2012 100.0
# 8 2011 100.0

Related

Replace missing values based on value of a specific column in Python

I would like to replace missing values based on the values of the column Submitted.
Find below what I have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
NaN
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
NaN
NaN
2020
GER
1
361
321
An this is what I would like to have:
Year
Country
Submitted
Age12
Age14
2018
CHI
1
267
NaN
2019
CHI
NaN
267
NaN
2020
CHI
1
244
203
2018
ALB
1
163
165
2019
ALB
1
NaN
NaN
2020
ALB
1
161
NaN
2018
GER
1
451
381
2019
GER
NaN
451
381
2020
GER
1
361
321
I tried using the command df.fillna(axis=0, method='ffill')
But this replace all values NaN by the previous, but this is not what I want because some values should be kept as NaN if the "Submitted" column value is 1.
I would like to change the values by the previous row only if the respective "Submitted" value is "NaN".
Thank you
Try using where together with what you did:
df = df.where(~df.Sumbitted.isnull(), df.fillna(axis=0, method='ffill'))
This will replace the entries only when Submitted is null.
You can do a conditional ffill() using np.where
import numpy as np
(
df.assign(Age12=np.where(df.Submitted.isna(), df.Age12.ffill(), df.Age12))
.assign(Age14=np.where(df.Submitted.isna(), df.Age14.ffill(), df.Age14))
)
You can use .filter() to select the related columns and put the columns in the list cols. Then, use .mask() to change the values of the selected columns by forward fill using ffill() when Submitted is NaN, as follows:
cols = df.filter(like='Age').columns
df[cols] = df[cols].mask(df['Submitted'].isna(), df[cols].ffill())
Result:
print(df)
Year Country Submitted Age12 Age14
0 2018 CHI 1.0 267.0 NaN
1 2019 CHI NaN 267.0 NaN
2 2020 CHI 1.0 244.0 203.0
3 2018 ALB 1.0 163.0 165.0
4 2019 ALB 1.0 NaN NaN
5 2020 ALB 1.0 161.0 NaN
6 2018 GER 1.0 451.0 381.0
7 2019 GER NaN 451.0 381.0
8 2020 GER 1.0 361.0 321.0
I just used a for loop to check and update the values in the dataframe
import pandas as pd
new_data = [[2018,'CHI',1,267,30], [2019,'CHI','NaN','NaN','NaN'], [2020,'CHI',1,244,203]]
df = pd.DataFrame(new_data, columns = ['Year','Country','Submitted','Age12','Age14'])
prevValue12 = df.iloc[0]['Age12']
prevValue14 = df.iloc[0]['Age14']
for index, row in df.iterrows():
if(row['Submitted']=='NaN'):
df.at[index,'Age12']=prevValue12
df.at[index,'Age14']=prevValue14
prevValue12 = row['Age12']
prevValue14 = row['Age14']
print(df)
output
Year Country Submitted Age12 Age14
0 2018 CHI 1 267 30
1 2019 CHI NaN 267 30
2 2020 CHI 1 244 203

python pandas add new column with values grouped count

I want to add a new column with the number of times the points were over 700 and after the year 2014.
import pandas as pd
ipl_data = {'Year': [2014,2015,2014,2015,2014,2015,2016,2017,2016,2014,2015,2017],
'Points':[876,789,863,673,741,812,756,788,694,701,804,690]}
df = pd.DataFrame(ipl_data)
grouped = df.groupby('Year')
df.loc[(df['Points'] > 700) & (df['Year'] > 2014), 'High_points'] = df['Points']
#df['Point_per_year_gr_700']=df.groupby(by='Year')['Points'].transform('count')
df['Point_per_year_gr_700']=grouped['Points'].agg(np.size))
the end dataframe should look like this, but I cant get the 'Point_per_year_gr_700' right
Year Points Point_per_year_gr_700 High_points
0 2014 876 NaN
1 2015 789 3 789.0
2 2014 863 NaN
3 2015 673 NaN
4 2014 741 NaN
5 2015 812 3 812.0
6 2016 756 1 756.0
7 2017 788 1 788.0
8 2016 694 NaN
9 2014 701 NaN
10 2015 804 3 804.0
11 2017 690 NaN
Use where to mask the DataFrame to NaN where your condition isn't met. You can use this to create the High_points column and also to exclude rows that shouldn't count when you groupby year and find how many rows satisfy High_points each year.
df['High_points'] = df['Points'].where(df['Year'].gt(2014) & df['Points'].gt(700))
df['ppy_gt700'] = (df.where(df['High_points'].notnull())
.groupby('Year')['Year'].transform('size'))
Year Points High_Points ppy_gt700
0 2014 876 NaN NaN
1 2015 789 789.0 3.0
2 2014 863 NaN NaN
3 2015 673 NaN NaN
4 2014 741 NaN NaN
5 2015 812 812.0 3.0
6 2016 756 756.0 1.0
7 2017 788 788.0 1.0
8 2016 694 NaN NaN
9 2014 701 NaN NaN
10 2015 804 804.0 3.0
11 2017 690 NaN NaN

Pandas drop nan in a specific row ('Feb-29') and shift remaining rows up

I have a pandas dataframe containing several years of timeseries data as columns. Each starts in November and ends in the subsequent year. I'm trying to deal with NaN's in non-leap years. The structure can be recreated with something like this:
import pandas as pd
import numpy as np
from datetime import datetime, timedelta
ndays = 151
sdates = [(datetime(2019,11,1) + timedelta(days=x)).strftime("%b-%d") for x in range(ndays)]
columns=list(range(2016,2021))
df = pd.DataFrame(np.random.randint(0,100,size=(ndays, len(columns))), index=sdates, columns=columns)
df.loc["Feb-29",2017:2019] = np.nan
df.loc["Feb-28":"Mar-01"]
Out[61]:
2016 2017 2018 2019 2020
Feb-28 36 59.0 74.0 19.0 24
Feb-29 85 NaN NaN NaN 6
Mar-01 24 75.0 49.0 99.0 82
What I want to do is remove the "Feb-29" NaN data only (in the non-leap years) and then shift teh data in those columns up a row, leaving the leap-years as-is. Something like this, with Mar-01 and subsequent rows shifted up for 2017 through 2019:
2016 2017 2018 2019 2020
Feb-28 36 59.0 74.0 19.0 24
Feb-29 85 75.0 49.0 99.0 6
Mar-01 24 42.0 21.0 41.0 82
I don't care that "Mar-01" data will be labelled as "Feb-29" as eventually I'll be replacing the string date index with an integer index.
Note that I didn't include this in the example but I have NaN's at the start and end of the dataframe in varying rows that I do not want to remove (i.e., I can't just remove all NaN data, I need to target "Feb-29" specifically)
It sounds like you don't actually want to shift dates up, but rather number them correctly based on the day of the year? If so, this will work:
First, make the DataFrame long instead of wide:
df = pd.DataFrame(
{
"2016": {"Feb-28": 36, "Feb-29": 85, "Mar-01": 24},
"2017": {"Feb-28": 59.0, "Feb-29": None, "Mar-01": 75.0},
"2018": {"Feb-28": 74.0, "Feb-29": None, "Mar-01": 49.0},
"2019": {"Feb-28": 19.0, "Feb-29": None, "Mar-01": 99.0},
"2020": {"Feb-28": 24, "Feb-29": 6, "Mar-01": 82},
}
)
df = (
df.melt(ignore_index=False, var_name="year", value_name="value")
.reset_index()
.rename(columns={"index": "month-day"})
)
df
month-day year value
0 Feb-28 2016 36.0
1 Feb-29 2016 85.0
2 Mar-01 2016 24.0
3 Feb-28 2017 59.0
4 Feb-29 2017 NaN
5 Mar-01 2017 75.0
6 Feb-28 2018 74.0
7 Feb-29 2018 NaN
8 Mar-01 2018 49.0
9 Feb-28 2019 19.0
10 Feb-29 2019 NaN
11 Mar-01 2019 99.0
12 Feb-28 2020 24.0
13 Feb-29 2020 6.0
14 Mar-01 2020 82.0
Then remove rows containing an invalid date and get the day of the year for remaining days:
df["date"] = pd.to_datetime(
df.apply(lambda row: " ".join(row[["year", "month-day"]]), axis=1), errors="coerce",
)
df = df[df["date"].notna()]
df["day_of_year"] = df["date"].dt.dayofyear
df
month-day year value date day_of_year
0 Feb-28 2016 36.0 2016-02-28 59
1 Feb-29 2016 85.0 2016-02-29 60
2 Mar-01 2016 24.0 2016-03-01 61
3 Feb-28 2017 59.0 2017-02-28 59
5 Mar-01 2017 75.0 2017-03-01 60
6 Feb-28 2018 74.0 2018-02-28 59
8 Mar-01 2018 49.0 2018-03-01 60
9 Feb-28 2019 19.0 2019-02-28 59
11 Mar-01 2019 99.0 2019-03-01 60
12 Feb-28 2020 24.0 2020-02-28 59
13 Feb-29 2020 6.0 2020-02-29 60
14 Mar-01 2020 82.0 2020-03-01 61
I think this would do the trick?
is_leap = ~np.isnan(df.loc["Feb-29"])
leap = df.loc[:, is_leap]
nonleap = df.loc[:, ~is_leap]
nonleap = nonleap.drop(index="Feb-29")
df2 = pd.concat([
leap.reset_index(drop=True),
nonleap.reset_index(drop=True)
], axis=1).sort_index(axis=1)
This should move all the nulls to the end by sorting on pd.isnull, which I believe it what you want:
df = df.apply(lambda x: sorted(x, key=pd.isnull),axis=0)
Before:
2016 2017 2018 2019 2020
Feb-28 35 85.0 88.0 46.0 19
Feb-29 41 NaN NaN NaN 52
Mar-01 86 92.0 29.0 44.0 36
After:
2016 2017 2018 2019 2020
Feb-28 35 85.0 88.0 46.0 19
Feb-29 41 92.0 29.0 44.0 52
Mar-01 86 32.0 50.0 27.0 36

Adding column in pandas based on values from other columns with conditions

I have a dataframe with information about sales of some products (unit):
unit year month price
0 1 2018 6 100
1 1 2013 4 70
2 2 2015 10 80
3 2 2015 2 110
4 3 2017 4 120
5 3 2002 6 90
6 4 2016 1 55
and I would like to add, for each sale, columns with information about the previous sales and NaN if there is no previous sale.
unit year month price prev_price prev_year prev_month
0 1 2018 6 100 70.0 2013.0 4.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 110.0 2015.0 2.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 90.0 2002.0 6.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN
For the moment I am doing some grouping on the unit, keeping those that have several rows, then extracting the information for these units that are associated with the minimal date. Then joining this table with my original table keeping only the rows that have a different date in the 2 tables that have been merged.
I feel like there is a much simple way to do this but I am not sure how.
Use DataFrameGroupBy.shift with add_prefix and join to append new DataFrame to original:
#if real data are not sorted
#df = df.sort_values(['unit','year','month'], ascending=[True, False, False])
df = df.join(df.groupby('unit', sort=False).shift(-1).add_prefix('prev_'))
print (df)
unit year month price prev_year prev_month prev_price
0 1 2018 6 100 2013.0 4.0 70.0
1 1 2013 4 70 NaN NaN NaN
2 2 2015 10 80 2015.0 2.0 110.0
3 2 2015 2 110 NaN NaN NaN
4 3 2017 4 120 2002.0 6.0 90.0
5 3 2002 6 90 NaN NaN NaN
6 4 2016 1 55 NaN NaN NaN

pandas DataFrame .stack(dropna=False) but keeping existing combinations of levels

My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN

Categories