I have a pandas dataframe like this:
column_year column_Month a_integer_column
0 2014 April 25.326531
1 2014 August 25.544554
2 2015 December 25.678261
3 2014 February 24.801187
4 2014 July 24.990338
... ... ... ...
68 2018 November 26.024931
69 2017 October 25.677333
70 2019 September 24.432361
71 2020 February 25.383648
72 2020 January 25.504831
I now want to sort year column first and then month column, like this below:
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
... ... ... ...
69 2017 October 25.677333
68 2018 November 26.024931
70 2019 September 24.432361
72 2020 January 25.504831
71 2020 February 25.383648
How do i do this?
Let us try to_datetime + argsort:
df=df.iloc[pd.to_datetime(df.column_year.astype(str)+df.column_Month,format='%Y%B').argsort()]
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
You can change the column_Month column into a CategoricalDtype
Months = pd.CategoricalDtype([
'January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December'
], ordered=True)
df.astype({'column_Month': Months}).sort_values(['column_year', 'column_Month'])
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
69 2017 October 25.677333
68 2018 November 26.024931
70 2019 September 24.432361
72 2020 January 25.504831
71 2020 February 25.383648
df=df.sort_values(by=["column_year", "column_Month"], ascending=[True, True])
Related
My dataframe contains zipcodes, months and the number of purchases up until that month.
However, some months are missing for some zipcodes. As you can see in the example below, the months March and April are not recorded for zipcode '2400'.
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
etc
I would like to add these month records, by repeating the cumulative purchases
Ideally it would look like this:
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
8 2400 March 2019 4
9 2400 April 2019 4
etc
You could use the complete function from pyjanitor to expose the missing values :
# pip install pyjanitor
import pandas as pd
import janitor as jn
df.complete('Zipcode', ('Date', 'Cumulative')).ffill()
Zipcode Date Cumulative purchases
0 9999 December 2018 2.0
1 9999 January 2019 2.0
2 9999 February 2019 2.0
3 9999 March 2019 3.0
4 9999 April 2019 4.0
5 2400 December 2018 2.0
6 2400 January 2019 3.0
7 2400 February 2019 4.0
8 2400 March 2019 4.0
9 2400 April 2019 4.0
Here is a bit changed previous answer with removed reset_index, reshape by Series.unstack and added missing datetimes up to until in DataFrame.reindex, forward filling missing values and reshape by DataFrame.stack :
df['Date'] = pd.to_datetime(df['Date'])
df = (df.set_index('Date')
.groupby('Zipcode', sort=False)
.resample('MS')['Purchase'].sum()
.groupby(level=0)
.cumsum()
.unstack()
)
until = pd.to_datetime('2019-04')
df = (df.reindex(pd.date_range(df.columns.min(), until, freq='MS', name='Date'), axis=1)
.ffill(axis=1)
.stack()
.astype(int)
.reset_index(name='Cumulative purchases'))
df['Date'] = df['Date'].dt.strftime('%B %Y')
print (df)
Zipcode Date Cumulative purchases
0 9999 December 2018 2
1 9999 January 2019 2
2 9999 February 2019 2
3 9999 March 2019 3
4 9999 April 2019 4
5 2400 December 2018 2
6 2400 January 2019 3
7 2400 February 2019 4
8 2400 March 2019 4
9 2400 April 2019 4
In pandas, I am trying to sort rows of a large data frame by months. At the moment, the months are out of order. They are sorted alphabetically, but I would like to sort them chronologically.
The tricky part is that I am sorting by a cycle of 21 months for every one product. There are two year columns, one for calendar year and one for fiscal year, and they differ on purpose. Fiscal Year 2021 is January 2021 - September 2021, and Fiscal Year 2022 is October 2021 - September 2022. There are hundreds of products, and the section below is just a sample of two products.
As seen in the table below, the months are out of order, but everything else is in the right order.
Again, ever product has 21 months, from January 2021 to September 2022. I want these to iterate in order for every product.
I am looking for a code to sort this data frame in the right way.
How it looks now (months not chronological by year):
Item
Calendar Year
Fiscal Year
Month
Amount
Product 1
2021
2021
April
45
Product 1
2021
2021
August
85
Product 1
2021
2021
February
25
Product 1
2021
2021
January
15
Product 1
2021
2021
July
75
Product 1
2021
2021
June
65
Product 1
2021
2021
March
35
Product 1
2021
2021
May
55
Product 1
2021
2021
September
95
Product 1
2021
2022
December
125
Product 1
2021
2022
November
115
Product 1
2021
2022
October
105
Product 1
2022
2022
April
405
Product 1
2022
2022
August
805
Product 1
2022
2022
February
205
Product 1
2022
2022
January
1005
Product 1
2022
2022
July
705
Product 1
2022
2022
June
605
Product 1
2022
2022
March
305
Product 1
2022
2022
May
505
Product 1
2022
2022
September
905
Product 2
2021
2021
April
4000
Product 2
2021
2021
August
8000
Product 2
2021
2021
February
2000
Product 2
2021
2021
January
1000
Product 2
2021
2021
July
7000
Product 2
2021
2021
June
6000
Product 2
2021
2021
March
3000
Product 2
2021
2021
May
5000
Product 2
2021
2021
September
9000
Product 2
2021
2022
December
12000
Product 2
2021
2022
November
11000
Product 2
2021
2022
October
10000
Product 2
2022
2022
April
40000
Product 2
2022
2022
August
80000
Product 2
2022
2022
February
20000
Product 2
2022
2022
January
10000
Product 2
2022
2022
July
70000
Product 2
2022
2022
June
60000
Product 2
2022
2022
March
30000
Product 2
2022
2022
May
50000
Product 2
2022
2022
September
90000
How it should look (months in order):
Item
Calendar Year
Fiscal Year
Month
Amount
Product 1
2021
2021
January
15
Product 1
2021
2021
February
25
Product 1
2021
2021
March
35
Product 1
2021
2021
April
45
Product 1
2021
2021
May
55
Product 1
2021
2021
June
65
Product 1
2021
2021
July
75
Product 1
2021
2021
August
85
Product 1
2021
2021
September
95
Product 1
2021
2022
October
105
Product 1
2021
2022
November
115
Product 1
2021
2022
December
125
Product 1
2022
2022
January
1005
Product 1
2022
2022
February
205
Product 1
2022
2022
March
305
Product 1
2022
2022
April
405
Product 1
2022
2022
May
505
Product 1
2022
2022
June
605
Product 1
2022
2022
July
705
Product 1
2022
2022
August
805
Product 1
2022
2022
September
905
Product 2
2021
2021
January
1000
Product 2
2021
2021
February
2000
Product 2
2021
2021
March
3000
Product 2
2021
2021
April
4000
Product 2
2021
2021
May
5000
Product 2
2021
2021
June
6000
Product 2
2021
2021
July
7000
Product 2
2021
2021
August
8000
Product 2
2021
2021
September
9000
Product 2
2021
2022
October
10000
Product 2
2021
2022
November
11000
Product 2
2021
2022
December
12000
Product 2
2022
2022
January
10000
Product 2
2022
2022
February
20000
Product 2
2022
2022
March
30000
Product 2
2022
2022
April
40000
Product 2
2022
2022
May
50000
Product 2
2022
2022
June
60000
Product 2
2022
2022
July
70000
Product 2
2022
2022
August
80000
Product 2
2022
2022
September
90000
First convert values to ordered categoricals, so possible sorting by multiple columns in DataFrame.sort_values:
cat = ['January','February','March','April','May','June',
'July','August','September','October','November','December']
df['Month'] = pd.Categorical(df['Month'], ordered=True, categories=cat)
df = df.sort_values(['Item','Calendar Year','Month'])
Or create DatetimeIndex, so possible sorting by Item with datetimes:
df.index = pd.to_datetime(df['Calendar Year'] + df['Month'], format='%Y%B')
df = df.rename_axis('dt').sort_values(['Item','dt']).reset_index(drop=True)
(?:\d{1,2}[\-\/])?(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[\,\.\s]*(?:\d{1,2}[\-\/\.)\s,]*)+(?:\d{2,4})(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov|Dec|January|February|March|April|May|June|July|August|September|October|November|December)?[\,\.\s]*(?:\d{1,2}[\-\/\.),]*)
I was trying to extract dates from the text from these ff. format:
January 1, 2020
January 01, 2020
JANUARY 1, 2020
JANUARY 01, 2020
Jan. 1, 2020
Jan. 01, 2020
JAN. 1, 2020
JAN. 01, 2020
2020 January 1
2020 January 01
2020 Jan. 1
2020 Jan. 01
2020 JAN. 1
2020 JAN. 01
01/01/2020
2020/01/01
01.01.2020
2020.01.01
01-01-2020
2020-01-01
Here's a sample. The problem is when it tries to extract from this format 2020 JAN. 1 , 2020 JAN. 01, 2020 Jan. 01, 2020-01-01.
You can use
pattern = r"""(?ix)
\b(?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?) [\s.]* (?:0?[1-9]|[12][0-9]|3[01]) [\s,.]* (?:19|20)(?:\d{2})? # Jan 01 2000
|
(?<!\d)(?:19|20)(?:\d{2})? [\s,.]* (?:Jan(?:uary)?|Feb(?:ruary)?|Mar(?:ch)?|Apr(?:il)?|May|June?|July?|Aug(?:ust)?|Sep(?:tember)?|Oct(?:ober)?|(?:Nov|Dec)(?:ember)?) [\s.]* (?:0?[1-9]|[12][0-9]|3[01]) # 2000 Jan 01
|
(?<!\d)
(?:
(?:0?[1-9]|1[012])[-/.]?(?:0?[1-9]|[12][0-9]|3[01])[-/.]?(?:19|20)\d\d # MM/dd/yyyy
|
(?:19|20)\d\d[-/.]?(?:0?[1-9]|1[012])[-/.]?(?:0?[1-9]|[12][0-9]|3[01]) # yyyy/MM/dd
)
(?!\d)"""
See the regex demo
The i modifier flag enables case insensitive matching and x enables the VERBOSE mode.
I have a dataframe which looks like this:
0 1 2
0 April 0.002745 ADANIPORTS.NS
1 July 0.005239 ASIANPAINT.NS
2 April 0.003347 AXISBANK.NS
3 April 0.004469 BAJAJ-AUTO.NS
4 June 0.006045 BAJFINANCE.NS
5 June 0.005176 BAJAJFINSV.NS
6 April 0.003321 BHARTIARTL.NS
7 November 0.003469 INFRATEL.NS
8 April 0.002667 BPCL.NS
9 April 0.003864 BRITANNIA.NS
10 April 0.005570 CIPLA.NS
11 October 0.000925 COALINDIA.NS
12 April 0.003666 DRREDDY.NS
13 April 0.002836 EICHERMOT.NS
14 April 0.003793 GAIL.NS
15 April 0.003850 GRASIM.NS
16 April 0.002858 HCLTECH.NS
17 December 0.005666 HDFC.NS
18 April 0.003484 HDFCBANK.NS
19 April 0.004173 HEROMOTOCO.NS
20 April 0.006395 HINDALCO.NS
21 June 0.001844 HINDUNILVR.NS
22 October 0.004620 ICICIBANK.NS
23 April 0.004020 INDUSINDBK.NS
24 January 0.002496 INFY.NS
25 September 0.001835 IOC.NS
26 May 0.002290 ITC.NS
27 April 0.005910 JSWSTEEL.NS
28 April 0.003570 KOTAKBANK.NS
29 May 0.003346 LT.NS
30 April 0.006131 M&M.NS
31 April 0.003912 MARUTI.NS
32 March 0.003596 NESTLEIND.NS
33 April 0.002180 NTPC.NS
34 April 0.003209 ONGC.NS
35 June 0.001796 POWERGRID.NS
36 April 0.004182 RELIANCE.NS
37 April 0.004246 SHREECEM.NS
38 October 0.004836 SBIN.NS
39 April 0.002596 SUNPHARMA.NS
40 April 0.004235 TCS.NS
41 April 0.006729 TATAMOTORS.NS
42 October 0.003395 TATASTEEL.NS
43 August 0.002440 TECHM.NS
44 June 0.003481 TITAN.NS
45 April 0.003749 ULTRACEMCO.NS
46 April 0.005854 UPL.NS
47 April 0.004991 VEDL.NS
48 July 0.001627 WIPRO.NS
49 April 0.003728 ZEEL.NS
how can i create a multiindex dataframe which would groupby in column 0. When i do:
new.groupby([0])
Out[315]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0A938BB0>
I am not able to group all the months together.
How to groupby and create a multiindex dataframe
Based on your info, I'd suggest the following:
#rename columns to make useful
new = new.rename(columns={0:'Month',1:'Price', 2:'Ticker'})
new.groupby(['Month','Ticker'])['Price'].sum()
To note - you should change change the 'Month' to a datetime or else the order will be illogical.
Also, the documentation is quite strong for pandas.
I am trying to sort this 'Month' column within my 'Mon18' table to run from January through to December with its corresponding count. When I try sort the column it orders it either by highest count or by sorting the 'Month' Column alphabetically. See an example below:
print (df)
Months Count
10 April 2018 684
3 August 2018 1098
1 December 2018 1207
11 February 2018 642
8 January 2018 847
5 July 2018 1040
6 June 2018 986
9 March 2018 809
7 May 2018 854
0 November 2018 1215
2 October 2018 1128
4 September 2018 1062
Idea is convert column to datetimes and use Series.argsort for indices passed to DataFrame.iloc:
df = df.iloc[pd.to_datetime(df['Months'], format='%B %Y').argsort()]
print (df)
Months Count
8 January 2018 847
11 February 2018 642
9 March 2018 809
10 April 2018 684
7 May 2018 854
6 June 2018 986
5 July 2018 1040
3 August 2018 1098
4 September 2018 1062
2 October 2018 1128
0 November 2018 1215
1 December 2018 1207