Pandas: fill out missing months in dataframe

Pandas: fill out missing months in dataframe - python

My dataframe contains zipcodes, months and the number of purchases up until that month.
However, some months are missing for some zipcodes. As you can see in the example below, the months March and April are not recorded for zipcode '2400'.
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
etc
I would like to add these month records, by repeating the cumulative purchases
Ideally it would look like this:
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
8 2400 March 2019 4
9 2400 April 2019 4
etc

You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.complete('Zipcode', ('Date', 'Cumulative')).ffill()
Zipcode Date Cumulative purchases
0 9999 December 2018 2.0
1 9999 January 2019 2.0
2 9999 February 2019 2.0
3 9999 March 2019 3.0
4 9999 April 2019 4.0
5 2400 December 2018 2.0
6 2400 January 2019 3.0
7 2400 February 2019 4.0
8 2400 March 2019 4.0
9 2400 April 2019 4.0

Here is a bit changed previous answer with removed reset_index, reshape by Series.unstack and added missing datetimes up to until in DataFrame.reindex, forward filling missing values and reshape by DataFrame.stack :
df['Date'] = pd.to_datetime(df['Date'])
df = (df.set_index('Date')
.groupby('Zipcode', sort=False)
.resample('MS')['Purchase'].sum()
.groupby(level=0)
.cumsum()
.unstack()
)
until = pd.to_datetime('2019-04')
df = (df.reindex(pd.date_range(df.columns.min(), until, freq='MS', name='Date'), axis=1)
.ffill(axis=1)
.stack()
.astype(int)
.reset_index(name='Cumulative purchases'))
df['Date'] = df['Date'].dt.strftime('%B %Y')
print (df)
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
8 2400 March 2019 4
9 2400 April 2019 4

Related

How to convert calendar year to financial year

Trying to convert calendar year to financial. I have a dataframe as below. Each ID will have multiple records. And the records might have missing months like 3rd row 3 month is missing
df:
ID Callender Date
1 01-01-2022
1 01-02-2022
1 01-04-2022
1 01-05-2022
1 01-05-2022
2 01-01-2022
2 01-07-2023
Expected output:
As the financial year starts form July to June
eg: FY 2022 means:
i.e.
July -2021 - This is 1st month in the financial year,
August- 2021 - This is 2nd month in the financial year
Sep -2021 - This is 3rd month in the financial year
Oct -2021 - This is 4th month in the financial year
Nov 2021 - - This is 5th month in the financial year
Dec 2021- - This is 6th month in the financial year
jan 2022- This is 7th month in the financial year
feb 2022- This is 8th month in the financial year
March 2022- This is 9th month in the financial year
April 2022- This is 10th month in the financial year
May 2022- This is 11th month in the financial year
June 2022- This is 12th month in the financial year`
Expected output: Convert Callender year to financial year:
ID Callender_Date Financial_Year Fiscal_Month
1 01-01-2022 2022 7
1 01-02-2022 2022 8
1 01-04-2022 2022 10
1 01-05-2022 2022 11
1 01-06-2022 2022 12
2 01-01-2021 2021 7
2 01-07-2021 2022 1`
Tried with below code- found in some other question
df['Callender_Date '] = df['Callender_Date '].asfreq('J-July') - 1

Try:
# convert the column to datetime (if not already):
df['Callender_Date'] = pd.to_datetime(df['Callender_Date'], dayfirst=True)
df['Financial_Year'] =df['Callender_Date'].dt.to_period('Q-JUL').dt.qyear
df['Fiscal_Month'] = (df['Callender_Date'] + pd.DateOffset(months=6)).dt.month
print(df)
Prints:
ID Callender_Date Financial_Year Fiscal_Month
0 1 2022-01-01 2022 7
1 1 2022-02-01 2022 8
2 1 2022-04-01 2022 10
3 1 2022-05-01 2022 11
4 1 2022-06-01 2022 12
5 2 2021-01-01 2021 7
6 2 2021-07-01 2021 1

How do I create a new column that references other row's data for its values?

I have the following data frame:
Month
Day
Year
Open
High
Low
Close
Week
0
1
1
2003
46.593
46.656
46.405
46.468
1
1
1
2
2003
46.538
46.66
46.47
46.673
1
2
1
3
2003
46.717
46.781
46.53
46.750
1
3
1
4
2003
46.815
46.843
46.68
46.750
1
4
1
5
2003
46.935
47.000
46.56
46.593
1
...
...
...
...
...
...
...
...
...
7257
10
26
2022
381.619
387.5799
381.350
382.019
43
7258
10
27
2022
383.07
385.00
379.329
379.98
43
7259
10
28
2022
379.869
389.519
379.67
389.019
43
7260
10
31
2022
386.44
388.399
385.26
386.209
44
7261
11
1
2022
390.14
390.39
383.29
384.519
44
I want to create a new column titled 'week high' which will reference each week every year and pull in the high. So for Week 1, Year 2003, it will take the Highest High from rows 0 to 4 but for Week 43, Year 2022, it will take the Highest High from rows 7257 to 7259.
Is it possible to reference the columns Week and Year to calculate that value? Thanks!

Assuming pandas, create a weekly period and use it as grouper for transform('max'):
group = pd.to_datetime(df[['Year', 'Month', 'Day']]).dt.to_period('W')
# or, if you already have a "Week" column
# group = "Week"
df['week_high'] = df.groupby(group)['High'].transform('max')
Output:
Month Day Year Open High Low Close Week week_high
0 1 1 2003 46.593 46.6560 46.405 46.468 1.0 47.000
1 1 2 2003 46.538 46.6600 46.470 46.673 1.0 47.000
2 1 3 2003 46.717 46.7810 46.530 46.750 1.0 47.000
3 1 4 2003 46.815 46.8430 46.680 46.750 1.0 47.000
4 1 5 2003 46.935 47.0000 46.560 46.593 1.0 47.000
7257 10 26 2022 381.619 387.5799 381.350 382.019 43.0 389.519
7258 10 27 2022 383.070 385.0000 379.329 379.980 43.0 389.519
7259 10 28 2022 379.869 389.5190 379.670 389.019 43.0 389.519
7260 10 31 2022 386.440 388.3990 385.260 386.209 44.0 390.390
7261 11 1 2022 390.140 390.3900 383.290 384.519 44 390.390

I am assuming you are using pandas. Other libraries will work similar.
Create a new DataFrame aggregated per week using groupby and join it back to your original DataFrame
df_grouped = df["Week", "High"].groupby("Week").max().rename(columns={"High":"Highest High"}
df_result = df.join(df_grouped, "Week")

How to remove certain string from column in pandas dataframe

I want to remove a certain keywords or string in a column from pandas dataframe.
The dataframe df looks like this:
YEAR WEEK
2019 WK-01
2019 WK-02
2019 WK-03
2019 WK-14
2019 WK-25
2020 WK-06
2020 WK-07
I would like to remove WK-and 0 from the WEEK column so that my output will looks like this:
YEAR WEEK
2019 1
2019 2
2019 3
2019 14
2019 25
2020 6
2020 7

You can try:
df['WEEK'] = df['WEEK'].str.extract('(\d*)$').astype(int)
Output:
YEAR WEEK
0 2019 1
1 2019 2
2 2019 3
3 2019 14
4 2019 25
5 2020 6
6 2020 7

Shave off the first three characters and convert to int.
df['WEEK'] = df['WEEK'].str[3:].astype(int)

Information matrix from pandas dataframe

I have a pandas dataframe like the following:
Customer Id year
0 1510220024 2017
1 1510270013 2017
2 1511160047 2017
3 1512100014 2017
4 1603180006 2017
5 1605030030 2017
6 1605160013 2017
7 1606060008 2017
8 1510220024 2018
9 1606270014 2017
10 1608080011 2017
11 1608090002 2017
12 1511160047 2018
13 1606270014 2018
And I want to build the following matrix from the above dataframe:
2017 2018
2017 11 3
2018 3 3
This matrix tells that there were total 11 customers in year 2017 and three of them also appeared in 2018 and so on. In actual, I have 7 years of data so it would be 7x7 matrix. I am struggling for a while now but can't get this right.

merge + crosstab:
m = df.merge(df, left_on='Customer Id', right_on='Customer Id')
pd.crosstab(m.year_x, m.year_y)
year_y 2017 2018
year_x
2017 11 3
2018 3 3

Python Pandas - Sort Values by keeping a specific order

I am trying to sort values by keeping the index in a specific order.
from random import randint
import pandas as pd
days = ["Tuesday", "Thursday", "Monday", "Wednesday"]
a = pd.DataFrame({"Value": [randint(0, 9) for i in range(len(days)*5)],
"Year": [y for i in range(len(days)) for y in range(2014,2019)]},
index=[day for day in days for i in range(5)])
myorder = ["Monday", "Tuesday", "Wednesday", "Thursday"]
a.index = pd.CategoricalIndex(a.index, categories=myorder, ordered=True)
a = a.sort_index()
By applying a.sort_index() I get my specific order. However, values of Year are random. If we naively a.sort_values(["Year"]), it modifies again the index order. How can I sort the Year values by keeping my initial index order ?

You need create column from index and sort together:
a = a.reset_index().sort_values(['index','Year']).set_index('index').rename_axis(None)
Or create MultiIndex from column and sort together:
a = (a.set_index('Year', append=True)
.sort_index()
.reset_index(level=1)
.reindex(columns=a.columns))
print (a)
Value Year
Monday 7 2014
Monday 3 2015
Monday 2 2016
Monday 5 2017
Monday 4 2018
Tuesday 6 2014
Tuesday 0 2015
Tuesday 0 2016
Tuesday 9 2017
Tuesday 2 2018
Wednesday 6 2014
Wednesday 7 2015
Wednesday 5 2016
Wednesday 5 2017
Wednesday 5 2018
Thursday 3 2014
Thursday 2 2015
Thursday 8 2016
Thursday 7 2017
Thursday 7 2018

Non-categorical approach, sorting by customized index order & Year simultaneously:
orderdic = dict(zip(myorder, range(len(myorder))))
a = a.assign(order=a.index.to_series().map(orderdic))\
.sort_values(['order', 'Year']).drop('order', 1)
# Value Year
# Monday 2 2014
# Monday 4 2015
# Monday 8 2016
# Monday 8 2017
# Monday 7 2018
# Tuesday 5 2014
# Tuesday 4 2015
# Tuesday 0 2016
# Tuesday 1 2017
# Tuesday 3 2018
# Wednesday 2 2014
# Wednesday 8 2015
# Wednesday 4 2016
# Wednesday 3 2017
# Wednesday 4 2018
# Thursday 7 2014
# Thursday 4 2015
# Thursday 7 2016
# Thursday 2 2017
# Thursday 1 2018

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Pandas: fill out missing months in dataframe - python

Related

How to convert calendar year to financial year

How do I create a new column that references other row's data for its values?

How to remove certain string from column in pandas dataframe

Information matrix from pandas dataframe

Python Pandas - Sort Values by keeping a specific order

Categories

Resources