How to reindex on month and year columns to insert missing data? - python

Consider following dataframe:
df = pd.read_csv("data.csv")
print(df)
Category Year Month Count1 Count2
0 a 2017 December 5 9
1 a 2018 January 3 5
2 b 2017 October 7 6
3 b 2017 November 4 1
4 b 2018 March 3 3
I want to achieve this:
Category Year Month Count1 Count2
0 a 2017 October
1 a 2017 November
2 a 2017 December 5 9
3 a 2018 January 3 5
4 a 2018 February
5 a 2018 March
6 b 2017 October 7 6
7 b 2017 November 4 1
8 b 2017 December
9 b 2018 January
10 b 2018 February
11 b 2018 March 3 3
Here I've done so far:
months = {"January": 1, "February": 2, "March": 3, "April": 4, "May": 5, "June": 6, "July": 7, "August": 8, "September": 9, "October": 10, "November": 11, "December": 12}
df["Date"] = pd.to_datetime(10000 * df["Year"] + 100 * df["Month"].apply(months.get) + 1, format="%Y%m%d")
date_min = df["Date"].min()
date_max = df["Date"].max()
new_index = pd.MultiIndex.from_product([df["Category"].unique(), pd.date_range(date_min, date_max, freq="M")], names=["Category", "Date"])
df = df.set_index(["Category", "Date"]).reindex(new_index).reset_index()
df["Year"] = df["Date"].dt.year
df["Month"] = df["Date"].dt.month_name()
df = df[["Category", "Year", "Month", "Count1", "Count2"]]
In the resulting dataframe last month (March) is missing and all "Count1", "Count2" values are NaN

This is complicated by the fact that you want to fill the category as well as the missing dates. One solution is to create a separate data frame for each category and then concatenate them all together.
df['Date'] = pd.to_datetime('1 '+df.Month.astype(str)+' '+df.Year.astype(str))
df_ix = pd.Series(1, index=df.Date.sort_values()).resample('MS').first().reset_index()
df_list = []
for cat in df.Category.unique():
df_temp = (df.query('Category==#cat')
.merge(df_ix, on='Date', how='right')
.get(['Date','Category','Count1','Count2'])
.sort_values('Date')
)
df_temp.Category = cat
df_temp = df_temp.fillna(0)
df_temp.loc[:,['Count1', 'Count2']] = df_temp.get(['Count1', 'Count2']).astype(int)
df_list.append(df_temp)
df2 = pd.concat(df_list, ignore_index=True)
df2['Month'] = df2.Date.apply(lambda x: x.strftime('%B'))
df2['Year'] = df2.Date.apply(lambda x: x.year)
df2.drop('Date', axis=1)
# returns:
Category Count1 Count2 Month Year
0 a 0 0 October 2017
1 a 0 0 November 2017
2 a 5 9 December 2017
3 a 3 5 January 2018
4 a 0 0 February 2018
5 a 0 0 March 2018
6 b 7 6 October 2017
7 b 4 1 November 2017
8 b 0 0 December 2017
9 b 0 0 January 2018
10 b 0 0 February 2018
11 b 3 3 March 2018

Related

Add column with month and year to a new date in pandas

I have this dataset:
df = pd.DataFrame()
df['year'] = [2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011,2011]
df['month'] = [1,2,3,4,5,6,1,2,3,4,5,6]
df['after'] = [0,0,0,1,1,1,0,0,0,1,1,1]
df['campaign'] = [0,0,0,0,0,0,1,1,1,1,1,1]
df['sales'] = [10000,11000,12000,10500,10000,9500,7000,8000,5000,6000,6000,7000]
And I want a new column date, that combines year and month into year-month date. I tried:
df['my_month'] = df['year']*100 + df['month'] + 1
But I'm stuck on what to do next. Any help will be greatly appreciated.
If we need start date of the month then
df['date'] = pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str))
Sample Output
year month after campaign sales date
0 2011 1 0 0 10000 2011-01-01
1 2011 2 0 0 11000 2011-02-01
2 2011 3 0 0 12000 2011-03-01
3 2011 4 1 0 10500 2011-04-01
Edit as per comment
When year and month format is required
pd.to_datetime(df.year.astype(str) + '-' + df.month.astype(str)).dt.to_period('M')
Sample output
year month after campaign sales date
0 2011 1 0 0 10000 2011-01
1 2011 2 0 0 11000 2011-02
2 2011 3 0 0 12000 2011-03
3 2011 4 1 0 10500 2011-04
import pandas as pd
from datetime import date
def get_date(year, month):
return date(year, month, 1)
def create_dataframe():
df = pd.DataFrame()
df['year'] = [2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011]
df['month'] = [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]
df['after'] = [0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1]
df['campaign'] = [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1]
df['sales'] = [10000, 11000, 12000, 10500, 10000, 9500, 7000, 8000, 5000, 6000, 6000, 7000]
df['my_month'] = df.apply(lambda x: get_date(x.year, x.month), axis=1)
print(df.to_string())
if __name__ == '__main__':
create_dataframe()
output
year month after campaign sales my_month
0 2011 1 0 0 10000 2011-01-01
1 2011 2 0 0 11000 2011-02-01
2 2011 3 0 0 12000 2011-03-01
3 2011 4 1 0 10500 2011-04-01
4 2011 5 1 0 10000 2011-05-01
5 2011 6 1 0 9500 2011-06-01
6 2011 1 0 1 7000 2011-01-01
7 2011 2 0 1 8000 2011-02-01
8 2011 3 0 1 5000 2011-03-01
9 2011 4 1 1 6000 2011-04-01
10 2011 5 1 1 6000 2011-05-01
11 2011 6 1 1 7000 2011-06-01

Add two dataframe rows?

I have a df with year, I'm trying to combine two rows in a dataframe.
df
year
0 2020
1 2019
2 2018
3 2017
4 2016
Final df
year combine
0 2020 2020-2019
1 2019 2019-2018
2 2018 2018-2017
3 2017 2017-2016
4 2016 NaN
Let us do shift
df['combine'] = df.year.astype(str) + '-' + df.year.astype(str).shift(-1)
df
Out[302]:
year combine
0 2020 2020-2019
1 2019 2019-2018
2 2018 2018-2017
3 2017 2017-2016
4 2016 NaN

Count Number of Rows within Time Interval in Pandas Dataframe

Say we have this data:
list1, list2, list3 = [1,2,3,4], [1990, 1990, 1990, 1991], [2009, 2009, 2009, 2009]
df = pd.DataFrame(list(zip(list1, list2, list3)), columns = ['Index', 'Y0', 'Y1'])
> df
Index Y0 Y1
1 1990 2009
2 1990 2009
3 1990 2009
4 1991 2009
I want to count, for each year, how many rows ("index") fall within each year, but excluding the Y0.
So say we start at the first available year, 1990:
How many rows do we count? 0.
1991:
Three (row 1, 2, 3)
1992:
Four (row 1, 2, 3, 4)
...
2009:
Four (row 1, 2, 3, 4)
So I want to end up with a dataframe that says:
Count Year
0 1990
3 1991
4. 1992
... ...
4 2009
My attempt:
df['Y0'] = pd.to_datetime(df['Y0'], format='%Y')
df['Y1'] = pd.to_datetime(df['Y1'], format='%Y')
# Group by the interval between Y0 and Y1
df = d.groupby([d['Y0'].dt.year, d['Y1'].dt.year]).agg({'count'})
df.columns = ['count', 'Y0 count', 'Y1 count']
# sum the total
df_sum = pd.DataFrame(df.groupby(df.index)['count'].sum())
But the result doesn't look right.
Appreciate any help.
you could do:
min_year = df[['Y0', 'Y1']].values.min()
max_year = df[['Y0', 'Y1']].values.max()
year_range = np.arange(min_year, max_year+1)
counts = ((df[['Y0']].values < year_range) & (year_range<= df[['Y1']].values)).sum(axis=0)
o = pd.DataFrame({"counts": counts, 'year': year_range})
counts year
0 0 1990
1 3 1991
2 4 1992
3 4 1993
4 4 1994
5 4 1995
6 4 1996
7 4 1997
8 4 1998
9 4 1999
10 4 2000
11 4 2001
12 4 2002
13 4 2003
14 4 2004
15 4 2005
16 4 2006
17 4 2007
18 4 2008
19 4 2009
The following should do your job:
counts=[]
years=[]
def count_in_interval(year):
n=0
for i in range(len(df)):
if df['Y0'][i]<year<=df['Y1'][i]:
n+=1
return n
for i in range(1990, 2010):
counts.append(count_in_interval(i))
years.append(i)
result=pd.DataFrame(zip(counts, years), columns=['Count', 'Year'])

Pandas dataframe multiple groupby filtering

I have the following dataframe:
df2 = pd.DataFrame({'season':[1,1,1,2,2,2,3,3],'value' : [-2, 3,1,5,8,6,7,5], 'test':[3,2,6,8,7,4,25,2],'test2':[4,5,7,8,9,10,11,12]},index=['2020', '2020', '2020','2020', '2020', '2021', '2021', '2021'])
df2.index= pd.to_datetime(df2.index)
df2.index = df2.index.year
print(df2)
season test test2 value
2020 1 3 4 -2
2020 1 2 5 3
2020 1 6 7 1
2020 2 8 8 5
2020 2 7 9 8
2021 2 4 10 6
2021 3 25 11 7
2021 3 2 12 5
I would like to filter it to obtain for each year and each season of that year the maximum value of the column 'value'. How can I do that efficiently?
Expected result:
print(df_result)
season value test test2
year
2020 1 3 2 5
2020 2 8 7 9
2021 2 6 4 10
2021 3 7 25 11
Thank you for your help,
Pierre
This is a groupby operation, but a little non-trivial, so posting as an answer.
(df2.set_index('season', append=True)
.groupby(level=[0, 1])
.value.max()
.reset_index(level=1)
)
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7
You can elevate your index to a series, then perform a groupby operation on a list of columns:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])['value'].max().reset_index()
print(df_result)
year season value
0 2020 1 4
1 2020 2 8
2 2021 2 6
3 2021 3 7
If you wish, you can make year your index again via df_result = df_result.set_index('year').
To keep other columns use:
df2['year'] = df2.index
df2['value'] = df2.groupby(['year', 'season'])['value'].transform('max')
Then drop any duplicates via pd.DataFrame.drop_duplicates.
Update #1
For your new requirement, you need to apply an aggregation function for 2 series:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])\
.agg({'value': 'max', 'test': 'last'})\
.reset_index()
print(df_result)
year season value test
0 2020 1 4 6
1 2020 2 8 7
2 2021 2 6 2
3 2021 3 7 2
Update #2
For your finalised requirement:
df2['year'] = df2.index
df2['max_value'] = df2.groupby(['year', 'season'])['value'].transform('max')
df_result = df2.loc[df2['value'] == df2['max_value']]\
.drop_duplicates(['year', 'season'])\
.drop('max_value', 1)
print(df_result)
season value test test2 year
2020 1 3 2 5 2020
2020 2 8 7 9 2020
2021 2 6 4 10 2021
2021 3 7 25 11 2021
You can using get_level_values for bring index value into groupby
df2.groupby([df2.index.get_level_values(0),df2.season]).value.max().reset_index(level=1)
Out[38]:
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7

A more complex rolling sum over the next n rows

I have the following dataframe:
print(df)
day month year quantity
6 04 2018 10
8 04 2018 8
12 04 2018 8
I would like to create a column, sum of the "quantity" over the next "n" days, as it follows:
n = 2
print(df1)
day month year quantity final_quantity
6 04 2018 10 10 + 0 + 8 = 18
8 04 2018 8 8 + 0 + 0 = 8
12 04 2018 8 8 + 0 + 0 = 8
Specifically, summing 0 if the product has not been sold in the next "n" days.
I tried rolling sums from Pandas, but does not seem to work on different columns:
n = 2
df.quantity[::-1].rolling(n + 1, min_periods=1).sum()[::-1]
You can use a list comprehension:
import pandas as pd
df['DateTime'] = pd.to_datetime(df[['year', 'month', 'day']])
df['final_quantity'] = [df.loc[df['DateTime'].between(d, d+pd.Timedelta(days=2)), 'quantity'].sum() \
for d in df['DateTime']]
print(df)
# day month year quantity DateTime final_quantity
# 0 6 4 2018 10 2018-04-06 18
# 1 8 4 2018 8 2018-04-08 8
# 2 12 4 2018 8 2018-04-12 8
You can use set_index and rolling with sum:
df_out = df.set_index(pd.to_datetime(df['month'].astype(str)+
df['day'].astype(str)+
df['year'].astype(str), format='%m%d%Y'))['quantity']
d1 = df_out.resample('D').asfreq(fill_value=0)
d2 = d1[::-1].reset_index()
df['final_quantity'] = d2['quantity'].rolling(3, min_periods=1).sum()[::-1].to_frame()\
.set_index(d1.index)\
.reindex(df_out.index).values
Output:
day month year quantity final_quantity
0 6 4 2018 10 18.0
1 8 4 2018 8 8.0
2 12 4 2018 8 8.0

Categories