Sorting a column by date - python

I am trying to sort this 'Month' column within my 'Mon18' table to run from January through to December with its corresponding count. When I try sort the column it orders it either by highest count or by sorting the 'Month' Column alphabetically. See an example below:
print (df)
Months Count
10 April 2018 684
3 August 2018 1098
1 December 2018 1207
11 February 2018 642
8 January 2018 847
5 July 2018 1040
6 June 2018 986
9 March 2018 809
7 May 2018 854
0 November 2018 1215
2 October 2018 1128
4 September 2018 1062

Idea is convert column to datetimes and use Series.argsort for indices passed to DataFrame.iloc:
df = df.iloc[pd.to_datetime(df['Months'], format='%B %Y').argsort()]
print (df)
Months Count
8 January 2018 847
11 February 2018 642
9 March 2018 809
10 April 2018 684
7 May 2018 854
6 June 2018 986
5 July 2018 1040
3 August 2018 1098
4 September 2018 1062
2 October 2018 1128
0 November 2018 1215
1 December 2018 1207

Related

Pandas: fill out missing months in dataframe

My dataframe contains zipcodes, months and the number of purchases up until that month.
However, some months are missing for some zipcodes. As you can see in the example below, the months March and April are not recorded for zipcode '2400'.
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
etc
I would like to add these month records, by repeating the cumulative purchases
Ideally it would look like this:
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
8 2400 March 2019 4
9 2400 April 2019 4
etc
You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.complete('Zipcode', ('Date', 'Cumulative')).ffill()
Zipcode Date Cumulative purchases
0 9999 December 2018 2.0
1 9999 January 2019 2.0
2 9999 February 2019 2.0
3 9999 March 2019 3.0
4 9999 April 2019 4.0
5 2400 December 2018 2.0
6 2400 January 2019 3.0
7 2400 February 2019 4.0
8 2400 March 2019 4.0
9 2400 April 2019 4.0
Here is a bit changed previous answer with removed reset_index, reshape by Series.unstack and added missing datetimes up to until in DataFrame.reindex, forward filling missing values and reshape by DataFrame.stack :
df['Date'] = pd.to_datetime(df['Date'])
df = (df.set_index('Date')
.groupby('Zipcode', sort=False)
.resample('MS')['Purchase'].sum()
.groupby(level=0)
.cumsum()
.unstack()
)
until = pd.to_datetime('2019-04')
df = (df.reindex(pd.date_range(df.columns.min(), until, freq='MS', name='Date'), axis=1)
.ffill(axis=1)
.stack()
.astype(int)
.reset_index(name='Cumulative purchases'))
df['Date'] = df['Date'].dt.strftime('%B %Y')
print (df)
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
8 2400 March 2019 4
9 2400 April 2019 4

Sorting a Data Frame by Month with Repeating Years, based on Unique 'Other' Column

In pandas, I am trying to sort rows of a large data frame by months. At the moment, the months are out of order. They are sorted alphabetically, but I would like to sort them chronologically.
The tricky part is that I am sorting by a cycle of 21 months for every one product. There are two year columns, one for calendar year and one for fiscal year, and they differ on purpose. Fiscal Year 2021 is January 2021 - September 2021, and Fiscal Year 2022 is October 2021 - September 2022. There are hundreds of products, and the section below is just a sample of two products.
As seen in the table below, the months are out of order, but everything else is in the right order.
Again, ever product has 21 months, from January 2021 to September 2022. I want these to iterate in order for every product.
I am looking for a code to sort this data frame in the right way.
How it looks now (months not chronological by year):
Item
Calendar Year
Fiscal Year
Month
Amount
Product 1
2021
2021
April
45
Product 1
2021
2021
August
85
Product 1
2021
2021
February
25
Product 1
2021
2021
January
15
Product 1
2021
2021
July
75
Product 1
2021
2021
June
65
Product 1
2021
2021
March
35
Product 1
2021
2021
May
55
Product 1
2021
2021
September
95
Product 1
2021
2022
December
125
Product 1
2021
2022
November
115
Product 1
2021
2022
October
105
Product 1
2022
2022
April
405
Product 1
2022
2022
August
805
Product 1
2022
2022
February
205
Product 1
2022
2022
January
1005
Product 1
2022
2022
July
705
Product 1
2022
2022
June
605
Product 1
2022
2022
March
305
Product 1
2022
2022
May
505
Product 1
2022
2022
September
905
Product 2
2021
2021
April
4000
Product 2
2021
2021
August
8000
Product 2
2021
2021
February
2000
Product 2
2021
2021
January
1000
Product 2
2021
2021
July
7000
Product 2
2021
2021
June
6000
Product 2
2021
2021
March
3000
Product 2
2021
2021
May
5000
Product 2
2021
2021
September
9000
Product 2
2021
2022
December
12000
Product 2
2021
2022
November
11000
Product 2
2021
2022
October
10000
Product 2
2022
2022
April
40000
Product 2
2022
2022
August
80000
Product 2
2022
2022
February
20000
Product 2
2022
2022
January
10000
Product 2
2022
2022
July
70000
Product 2
2022
2022
June
60000
Product 2
2022
2022
March
30000
Product 2
2022
2022
May
50000
Product 2
2022
2022
September
90000
How it should look (months in order):
Item
Calendar Year
Fiscal Year
Month
Amount
Product 1
2021
2021
January
15
Product 1
2021
2021
February
25
Product 1
2021
2021
March
35
Product 1
2021
2021
April
45
Product 1
2021
2021
May
55
Product 1
2021
2021
June
65
Product 1
2021
2021
July
75
Product 1
2021
2021
August
85
Product 1
2021
2021
September
95
Product 1
2021
2022
October
105
Product 1
2021
2022
November
115
Product 1
2021
2022
December
125
Product 1
2022
2022
January
1005
Product 1
2022
2022
February
205
Product 1
2022
2022
March
305
Product 1
2022
2022
April
405
Product 1
2022
2022
May
505
Product 1
2022
2022
June
605
Product 1
2022
2022
July
705
Product 1
2022
2022
August
805
Product 1
2022
2022
September
905
Product 2
2021
2021
January
1000
Product 2
2021
2021
February
2000
Product 2
2021
2021
March
3000
Product 2
2021
2021
April
4000
Product 2
2021
2021
May
5000
Product 2
2021
2021
June
6000
Product 2
2021
2021
July
7000
Product 2
2021
2021
August
8000
Product 2
2021
2021
September
9000
Product 2
2021
2022
October
10000
Product 2
2021
2022
November
11000
Product 2
2021
2022
December
12000
Product 2
2022
2022
January
10000
Product 2
2022
2022
February
20000
Product 2
2022
2022
March
30000
Product 2
2022
2022
April
40000
Product 2
2022
2022
May
50000
Product 2
2022
2022
June
60000
Product 2
2022
2022
July
70000
Product 2
2022
2022
August
80000
Product 2
2022
2022
September
90000
First convert values to ordered categoricals, so possible sorting by multiple columns in DataFrame.sort_values:
cat = ['January','February','March','April','May','June',
'July','August','September','October','November','December']
df['Month'] = pd.Categorical(df['Month'], ordered=True, categories=cat)
df = df.sort_values(['Item','Calendar Year','Month'])
Or create DatetimeIndex, so possible sorting by Item with datetimes:
df.index = pd.to_datetime(df['Calendar Year'] + df['Month'], format='%Y%B')
df = df.rename_axis('dt').sort_values(['Item','dt']).reset_index(drop=True)

I need help plotting a bar graph from a dataframe

I have the following dataframe:
AQI Year City
0 349.407407 2015 'Patna'
1 297.024658 2015 'Delhi'
2 283.007605 2015 'Ahmedabad'
3 297.619178 2016 'Delhi'
4 282.717949 2016 'Ahmedabad'
5 250.528701 2016 'Patna'
6 379.753623 2017 'Ahmedabad'
7 325.652778 2017 'Patna'
8 281.401216 2017 'Gurugram'
9 443.053221 2018 'Ahmedabad'
10 248.367123 2018 'Delhi'
11 233.772603 2018 'Lucknow'
12 412.781250 2019 'Ahmedabad'
13 230.720548 2019 'Delhi'
14 217.626741 2019 'Patna'
15 214.681818 2020 'Ahmedabad'
16 181.672131 2020 'Delhi'
17 162.251366 2020 'Patna'
I would like to group data for each year, i.e. 2015, 2016, 2017 2018...2020 on the x axis, with AQI on the y axis. I am a newbie and please excuse the lack of depth in my question.
You can "pivot" your data to support your desired plotting output. Here we set the rows as Year, columns as City, and values as AQI.
pivot = pd.pivot_table(
data=df,
index='Year',
columns='City',
values='AQI',
)
Year
Ahmedabad
Delhi
Gurugram
Lucknow
Patna
2015
283.007605
297.024658
NaN
NaN
349.407407
2016
282.717949
297.619178
NaN
NaN
250.528701
2017
379.753623
NaN
281.401216
NaN
325.652778
2018
443.053221
248.367123
NaN
233.772603
NaN
2019
412.781250
230.720548
NaN
NaN
217.626741
2020
214.681818
181.672131
NaN
NaN
162.251366
Then you can plot this pivot table directly:
pivot.plot.bar(xlabel='Year', ylabel='AQI')
Old answer
Are you looking for the mean AQI per year? If so, you can do some pandas chaining, assuming your data is in a DataFrame df:
df.groupby('Year').mean().plot.bar(xlabel='Year', ylabel='AQI')

How to sort pandas dataframe by two date columns

I have a pandas dataframe like this:
column_year column_Month a_integer_column
0 2014 April 25.326531
1 2014 August 25.544554
2 2015 December 25.678261
3 2014 February 24.801187
4 2014 July 24.990338
... ... ... ...
68 2018 November 26.024931
69 2017 October 25.677333
70 2019 September 24.432361
71 2020 February 25.383648
72 2020 January 25.504831
I now want to sort year column first and then month column, like this below:
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
... ... ... ...
69 2017 October 25.677333
68 2018 November 26.024931
70 2019 September 24.432361
72 2020 January 25.504831
71 2020 February 25.383648
How do i do this?
Let us try to_datetime + argsort:
df=df.iloc[pd.to_datetime(df.column_year.astype(str)+df.column_Month,format='%Y%B').argsort()]
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
You can change the column_Month column into a CategoricalDtype
Months = pd.CategoricalDtype([
'January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December'
], ordered=True)
df.astype({'column_Month': Months}).sort_values(['column_year', 'column_Month'])
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
69 2017 October 25.677333
68 2018 November 26.024931
70 2019 September 24.432361
72 2020 January 25.504831
71 2020 February 25.383648
df=df.sort_values(by=["column_year", "column_Month"], ascending=[True, True])

Information matrix from pandas dataframe

I have a pandas dataframe like the following:
Customer Id year
0 1510220024 2017
1 1510270013 2017
2 1511160047 2017
3 1512100014 2017
4 1603180006 2017
5 1605030030 2017
6 1605160013 2017
7 1606060008 2017
8 1510220024 2018
9 1606270014 2017
10 1608080011 2017
11 1608090002 2017
12 1511160047 2018
13 1606270014 2018
And I want to build the following matrix from the above dataframe:
2017 2018
2017 11 3
2018 3 3
This matrix tells that there were total 11 customers in year 2017 and three of them also appeared in 2018 and so on. In actual, I have 7 years of data so it would be 7x7 matrix. I am struggling for a while now but can't get this right.
merge + crosstab:
m = df.merge(df, left_on='Customer Id', right_on='Customer Id')
pd.crosstab(m.year_x, m.year_y)
year_y 2017 2018
year_x
2017 11 3
2018 3 3

Categories