How to groupby and create a multiindex dataframe

How to groupby and create a multiindex dataframe - python

I have a dataframe which looks like this:
0 1 2
0 April 0.002745 ADANIPORTS.NS
1 July 0.005239 ASIANPAINT.NS
2 April 0.003347 AXISBANK.NS
3 April 0.004469 BAJAJ-AUTO.NS
4 June 0.006045 BAJFINANCE.NS
5 June 0.005176 BAJAJFINSV.NS
6 April 0.003321 BHARTIARTL.NS
7 November 0.003469 INFRATEL.NS
8 April 0.002667 BPCL.NS
9 April 0.003864 BRITANNIA.NS
10 April 0.005570 CIPLA.NS
11 October 0.000925 COALINDIA.NS
12 April 0.003666 DRREDDY.NS
13 April 0.002836 EICHERMOT.NS
14 April 0.003793 GAIL.NS
15 April 0.003850 GRASIM.NS
16 April 0.002858 HCLTECH.NS
17 December 0.005666 HDFC.NS
18 April 0.003484 HDFCBANK.NS
19 April 0.004173 HEROMOTOCO.NS
20 April 0.006395 HINDALCO.NS
21 June 0.001844 HINDUNILVR.NS
22 October 0.004620 ICICIBANK.NS
23 April 0.004020 INDUSINDBK.NS
24 January 0.002496 INFY.NS
25 September 0.001835 IOC.NS
26 May 0.002290 ITC.NS
27 April 0.005910 JSWSTEEL.NS
28 April 0.003570 KOTAKBANK.NS
29 May 0.003346 LT.NS
30 April 0.006131 M&M.NS
31 April 0.003912 MARUTI.NS
32 March 0.003596 NESTLEIND.NS
33 April 0.002180 NTPC.NS
34 April 0.003209 ONGC.NS
35 June 0.001796 POWERGRID.NS
36 April 0.004182 RELIANCE.NS
37 April 0.004246 SHREECEM.NS
38 October 0.004836 SBIN.NS
39 April 0.002596 SUNPHARMA.NS
40 April 0.004235 TCS.NS
41 April 0.006729 TATAMOTORS.NS
42 October 0.003395 TATASTEEL.NS
43 August 0.002440 TECHM.NS
44 June 0.003481 TITAN.NS
45 April 0.003749 ULTRACEMCO.NS
46 April 0.005854 UPL.NS
47 April 0.004991 VEDL.NS
48 July 0.001627 WIPRO.NS
49 April 0.003728 ZEEL.NS
how can i create a multiindex dataframe which would groupby in column 0. When i do:
new.groupby([0])
Out[315]: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x0A938BB0>
I am not able to group all the months together.
How to groupby and create a multiindex dataframe

Based on your info, I'd suggest the following:
#rename columns to make useful
new = new.rename(columns={0:'Month',1:'Price', 2:'Ticker'})
new.groupby(['Month','Ticker'])['Price'].sum()
To note - you should change change the 'Month' to a datetime or else the order will be illogical.
Also, the documentation is quite strong for pandas.

Related

pd.read_html() not reading date

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:

One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF

Try setting the parse_dates parameter to True inside read_html method.

Python Pandas multiindex

i'm try create table like in example:
Example_picture
My code:
data = list(range(39)) # mockup for 39 values
columns = pd.MultiIndex.from_product([['1', '2', '6'], [str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame(data, index=['World'], columns=columns)
print(df)
But i get error:
Shape of passed values is (39, 1), indices imply (1, 39)
What i'm did wrong?

You need to wrap the data in a list to force the DataFrame constructor to interpret the list as a row:
data = list(range(39))
columns = pd.MultiIndex.from_product([['1', '2', '6'],
[str(year) for year in range(2007, 2020)]],
names=['Factor', 'Year'])
df = pd.DataFrame([data], index=['World'], columns=columns)
output:
Factor 1 2 6
Year 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019
World 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38

how to make a slice by week and year in pandas

I have a table
Year
Week
Sales
2021
47
56
2021
48
5
2021
49
4
2021
50
6
2021
51
7
2021
52
10
2022
1
2
2021
2
3
I want to get all data from 2021 year 49 week. However if I make the following slice:
table[(table.Year >= 2021) & (table.Week >= 49)]
I get the data for every week that >= 49 for every year starting 2021. How to conclude into the slice weeks 1-48 year 2022 without creating new column? I mean how to get all data from the table starting from year 2021 week 49 (2021: week 49-52, 2022: week 1-52, 2023: week 1-52 etc.)

You're missing an OR. IIUC, you want data beyond Week 49 of 2021. In logical expression that can be written as the year is greater than 2021 OR the year is 2021, but the week is greater than 49:
out = table[((table.Year == 2021) & (table.Week >= 49)) | (table.Year > 2021)]
Output:
Year Week Sales
2 2021 49 4
3 2021 50 6
4 2021 51 7
5 2021 52 10
6 2022 1 2

Pandas cumulative sum without changing week order number

I have a dataframe which looks like the following:
df:
RY Week no Value
2020 14 3.95321
2020 15 3.56425
2020 16 0.07042
2020 17 6.45417
2020 18 0.00029
2020 19 0.27737
2020 20 4.12644
2020 21 0.32753
2020 22 0.47239
2020 23 0.28756
2020 24 1.83029
2020 25 0.75385
2020 26 2.08981
2020 27 2.05611
2020 28 1.00614
2020 29 0.02105
2020 30 0.58101
2020 31 3.49083
2020 32 8.29013
2020 33 8.99825
2020 34 2.66293
2020 35 0.16448
2020 36 2.26301
2020 37 1.09302
2020 38 1.66566
2020 39 1.47233
2020 40 6.42708
2020 41 2.67947
2020 42 6.79551
2020 43 4.45881
2020 44 1.87972
2020 45 0.76284
2020 46 1.8671
2020 47 2.07159
2020 48 2.87303
2020 49 7.66944
2020 50 1.20421
2020 51 9.04416
2020 52 2.2625
2020 1 1.17026
2020 2 14.22263
2020 3 1.36464
2020 4 2.64862
2020 5 8.69916
2020 6 4.51259
2020 7 2.83411
2020 8 3.64183
2020 9 4.77292
2020 10 1.64729
2020 11 1.6878
2020 12 2.24874
2020 13 0.32712
I created a week no column using date. In my scenario regulatory year starts from 1st April and ends at 31st March of next year that's why Week no starts from 14 and ends at 13. Now I want to create another column that contains the cumulative sum of the value column. I tried to use cumsum() by using the following code:
df['Cummulative Value'] = df.groupby('RY')['Value'].apply(lambda x:x.cumsum())
The problem with the above code is that it starts calculating the cumulative sum from week no 1 not from week no 14 onwards. Is there any way to calculate the cumulative sum without disturbing the week order number?

EDIT: You can sorting values by RY and Week no before GroupBy.cumsum and last sorting index for original order:
#create default index for correct working
df = df.reset_index(drop=True)
df['Cummulative Value'] = df.sort_values(['RY','Week no']).groupby('RY')['Value'].cumsum().sort_index()
print (df)
RY Week no Value Cummulative Value
0 2020 14 3.95321 53.73092
1 2020 15 3.56425 57.29517
2 2020 16 0.07042 57.36559
3 2020 17 6.45417 63.81976
4 2020 18 0.00029 63.82005
5 2020 19 0.27737 64.09742
6 2020 20 4.12644 68.22386
7 2020 21 0.32753 68.55139
8 2020 22 0.47239 69.02378
9 2020 23 0.28756 69.31134
10 2020 24 1.83029 71.14163
11 2020 25 0.75385 71.89548
12 2020 26 2.08981 73.98529
13 2020 27 2.05611 76.04140
14 2020 28 1.00614 77.04754
15 2020 29 0.02105 77.06859
16 2020 30 0.58101 77.64960
17 2020 31 3.49083 81.14043
18 2020 32 8.29013 89.43056
19 2020 33 8.99825 98.42881
20 2020 34 2.66293 101.09174
21 2020 35 0.16448 101.25622
22 2020 36 2.26301 103.51923
23 2020 37 1.09302 104.61225
24 2020 38 1.66566 106.27791
25 2020 39 1.47233 107.75024
26 2020 40 6.42708 114.17732
27 2020 41 2.67947 116.85679
28 2020 42 6.79551 123.65230
29 2020 43 4.45881 128.11111
30 2020 44 1.87972 129.99083
31 2020 45 0.76284 130.75367
32 2020 46 1.86710 132.62077
33 2020 47 2.07159 134.69236
34 2020 48 2.87303 137.56539
35 2020 49 7.66944 145.23483
36 2020 50 1.20421 146.43904
37 2020 51 9.04416 155.48320
38 2020 52 2.26250 157.74570
39 2020 1 1.17026 1.17026
40 2020 2 14.22263 15.39289
41 2020 3 1.36464 16.75753
42 2020 4 2.64862 19.40615
43 2020 5 8.69916 28.10531
44 2020 6 4.51259 32.61790
45 2020 7 2.83411 35.45201
46 2020 8 3.64183 39.09384
47 2020 9 4.77292 43.86676
48 2020 10 1.64729 45.51405
49 2020 11 1.68780 47.20185
50 2020 12 2.24874 49.45059
51 2020 13 0.32712 49.77771
EDIT:
After some discussion solution should be simplify by GroupBy.cumsum:
df['Cummulative Value'] = df.groupby('RY')['Value'].cumsum()

How to sort pandas dataframe by two date columns

I have a pandas dataframe like this:
column_year column_Month a_integer_column
0 2014 April 25.326531
1 2014 August 25.544554
2 2015 December 25.678261
3 2014 February 24.801187
4 2014 July 24.990338
... ... ... ...
68 2018 November 26.024931
69 2017 October 25.677333
70 2019 September 24.432361
71 2020 February 25.383648
72 2020 January 25.504831
I now want to sort year column first and then month column, like this below:
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
... ... ... ...
69 2017 October 25.677333
68 2018 November 26.024931
70 2019 September 24.432361
72 2020 January 25.504831
71 2020 February 25.383648
How do i do this?

Let us try to_datetime + argsort:
df=df.iloc[pd.to_datetime(df.column_year.astype(str)+df.column_Month,format='%Y%B').argsort()]
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261

You can change the column_Month column into a CategoricalDtype
Months = pd.CategoricalDtype([
'January', 'February', 'March', 'April', 'May', 'June',
'July', 'August', 'September', 'October', 'November', 'December'
], ordered=True)
df.astype({'column_Month': Months}).sort_values(['column_year', 'column_Month'])
column_year column_Month a_integer_column
3 2014 February 24.801187
0 2014 April 25.326531
4 2014 July 24.990338
1 2014 August 25.544554
2 2015 December 25.678261
69 2017 October 25.677333
68 2018 November 26.024931
70 2019 September 24.432361
72 2020 January 25.504831
71 2020 February 25.383648

df=df.sort_values(by=["column_year", "column_Month"], ascending=[True, True])

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to groupby and create a multiindex dataframe - python

Related

pd.read_html() not reading date

Python Pandas multiindex

how to make a slice by week and year in pandas

Pandas cumulative sum without changing week order number

How to sort pandas dataframe by two date columns

Categories

Resources