Create a dataframe including group by sums and total sum - python

I have the following data frame:
Race Course Horse Year Month Day Amount Won/Lost
0 Aintree Red Rum 2017 5 12 11.58 won
1 Punchestown Camelot 2016 12 22 122.52 won
2 Sandown Beef of Salmon 2016 11 17 20.00 lost
3 Ayr Corbiere 2016 11 3 25.00 lost
4 Fairyhouse Red Rum 2016 12 2 65.75 won
5 Ayr Camelot 2017 3 11 12.05 won
6 Aintree Hurricane Fly 2017 5 12 11.58 won
7 Punchestown Beef or Salmon 2016 12 22 112.52 won
8 Sandown Aldaniti 2016 11 17 10.00 lost
9 Ayr Henry the Navigator 2016 11 1 15.00 lost
10 Fairyhouse Jumanji 2016 10 2 65.75 won
11 Ayr Came Second 2017 3 11 12.05 won
12 Aintree Murder 2017 5 12 5.00 lost
13 Punchestown King Arthur 2016 6 22 52.52 won
14 Sandown Filet of Fish 2016 11 17 20.00 lost
15 Ayr Denial 2016 11 3 25.00 lost
16 Fairyhouse Don't Gamble 2016 12 12 165.75 won
17 Ayr Ireland 2017 1 11 22.05 won
I am trying to create another data frame which includes only the sum of all races (rows) and the sum of all races won. It would ideally look like the following:
total races 18
total won 11
However, all I have been able to do is group by counts, counting total won and total lost. This is what I have attempted:
df = df.groupby(['Won/Lost']).size().add_prefix('total')
And this is what it returns:
Won/Lost
total lost 7
total won 11
dtype: int64
I am at a dead end and cannot figure out a simple solution.

Assuming content of races.csv is:
Race Course,Horse,Year,Month,Day,Amount,Won/Lost
Aintree,Red Rum,2017,5,12,11.58,won
Punchestown,Camelot,2016,12,22,122.52,won
Sandown,Beef of Salmon,2016,11,17,20.00,lost
Ayr,Corbiere,2016,11,3,25.00,lost
Fairyhouse,Red Rum,2016,12,2,65.75,won
Ayr,Camelot,2017,3,11,12.05,won
Aintree,Hurricane Fly,2017,5,12,11.58,won
Punchestown,Beef or Salmon,2016,12,22,112.52,won
Sandown,Aldaniti,2016,11,17,10.00,lost
Ayr,Henry the Navigator,2016,11,1,15.00,lost
Fairyhouse,Jumanji,2016,10,2,65.75,won
Ayr,Came Second,2017,3,11,12.05,won
Aintree,Murder,2017,5,12,5.00,lost
Punchestown,King Arthur,2016,6,22,52.52,won
Sandown,Filet of Fish,2016,11,17,20.00,lost
Ayr,Denial,2016,11,3,25.00,lost
Fairyhouse,Don't Gamble,2016,12,12,165.75,won
Ayr,Ireland,2017,1,11,22.05,won
Steps to get the new dataframe:
>>> races_df = pd.read_csv('races.csv')
>>> races_df
Race Course Horse Year Month Day Amount Won/Lost
0 Aintree Red Rum 2017 5 12 11.58 won
1 Punchestown Camelot 2016 12 22 122.52 won
2 Sandown Beef of Salmon 2016 11 17 20.00 lost
3 Ayr Corbiere 2016 11 3 25.00 lost
4 Fairyhouse Red Rum 2016 12 2 65.75 won
5 Ayr Camelot 2017 3 11 12.05 won
6 Aintree Hurricane Fly 2017 5 12 11.58 won
7 Punchestown Beef or Salmon 2016 12 22 112.52 won
8 Sandown Aldaniti 2016 11 17 10.00 lost
9 Ayr Henry the Navigator 2016 11 1 15.00 lost
10 Fairyhouse Jumanji 2016 10 2 65.75 won
11 Ayr Came Second 2017 3 11 12.05 won
12 Aintree Murder 2017 5 12 5.00 lost
13 Punchestown King Arthur 2016 6 22 52.52 won
14 Sandown Filet of Fish 2016 11 17 20.00 lost
15 Ayr Denial 2016 11 3 25.00 lost
16 Fairyhouse Don't Gamble 2016 12 12 165.75 won
17 Ayr Ireland 2017 1 11 22.05 won
>>>
>>> total_races = len(races_df)
>>>
>>> total_win = races_df[races_df['Won/Lost'] == 'won']['Won/Lost'].count()
>>>
>>> new_df = pd.DataFrame({'total_races': total_races, 'total_win': total_win}, index=pd.RangeIndex(1))
>>>
>>> new_df
total_races total_win
0 18 11

Related

pd.read_html() not reading date

When I try to parse a wiki page for its tables, the tables are read correctly except for the date of birth column, which comes back as empty. Is there a workaround for this? I've tried using beautiful soup but I get the same result.
The code I've used is as follows:
url = 'https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads'
pd.read_html(url)
Here's an image of one of the tables in question:
One possible solution can be alter the page content with beautifulsoup and then load it to pandas:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://en.wikipedia.org/wiki/2002_FIFA_World_Cup_squads"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
# select correct table, here I select the first one:
tbl = soup.select("table")[0]
# remove the (aged XX) part:
for td in tbl.select("td:nth-of-type(3)"):
td.string = td.contents[-1].split("(")[0]
df = pd.read_html(str(tbl))[0]
print(df)
Prints:
No. Pos. Player Date of birth (age) Caps Club
0 1 GK Thomas Sørensen 12 June 1976 14 Sunderland
1 2 MF Stig Tøfting 14 August 1969 36 Bolton Wanderers
2 3 DF René Henriksen 27 August 1969 39 Panathinaikos
3 4 DF Martin Laursen 26 July 1977 15 Milan
4 5 DF Jan Heintze (c) 17 August 1963 83 PSV Eindhoven
5 6 DF Thomas Helveg 24 June 1971 67 Milan
6 7 MF Thomas Gravesen 11 March 1976 22 Everton
7 8 MF Jesper Grønkjær 12 August 1977 25 Chelsea
8 9 FW Jon Dahl Tomasson 29 August 1976 38 Feyenoord
9 10 MF Martin Jørgensen 6 October 1975 32 Udinese
10 11 FW Ebbe Sand 19 July 1972 44 Schalke 04
11 12 DF Niclas Jensen 17 August 1974 8 Manchester City
12 13 DF Steven Lustü 13 April 1971 4 Lyn
13 14 MF Claus Jensen 29 April 1977 13 Charlton Athletic
14 15 MF Jan Michaelsen 28 November 1970 11 Panathinaikos
15 16 GK Peter Kjær 5 November 1965 4 Aberdeen
16 17 MF Christian Poulsen 28 February 1980 3 Copenhagen
17 18 FW Peter Løvenkrands 29 January 1980 4 Rangers
18 19 MF Dennis Rommedahl 22 July 1978 19 PSV Eindhoven
19 20 DF Kasper Bøgelund 8 October 1980 2 PSV Eindhoven
20 21 FW Peter Madsen 26 April 1978 4 Brøndby
21 22 GK Jesper Christiansen 24 April 1978 0 Vejle
22 23 MF Brian Steen Nielsen 28 December 1968 65 Malmö FF
Try setting the parse_dates parameter to True inside read_html method.

Pandas Python - How to create new columns with MultiIndex from pivot table

I have created a pivot table with 2 different types of values i) Number of apples from 2017-2020, ii) Number of people from 2017-2020. I want to create additional columns to calculate iii) Apples per person from 2017-2020. How can I do so?
Current code for pivot table:
tdf = df.pivot_table(index="States",
columns="Year",
values=["Number of Apples","Number of People"],
aggfunc= lambda x: len(x.unique()),
margins=True)
tdf
Here is my current pivot table:
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
...
I want my pivot table to look like this, where I add additional columns to divide Number of Apples by Number of People.
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5 6 5 5
West Virginia 8 35 25 12 2 5 5 4 4 7 5 3
I've tried a few things, such as:
Creating a new column via assigning new column names, but does not work with multiple column index tdf["Number of Apples per Person"][2017] = tdf["Number of Apples"][2017] / tdf["Number of People"][2017]
Tried the other assignment method tdf.assign(tdf["Number of Apples per Person"][2017] = tdf["Enrollment ID"][2017] / tdf["Student ID"][2017]); got this error SyntaxError: expression cannot contain assignment, perhaps you meant "=="?
Appreciate any help! Thanks
What you can do here is stack(), do your thing, and then unstack():
s = df.stack()
s['Number of Apples per Person'] = s['Number of Apples'] / s['Number of People']
df = s.unstack()
Output:
>>> df
Number of Apples Number of People Number of Apples per Person
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
One-liner:
df = df.stack().pipe(lambda x: x.assign(**{'Number of Apples per Person': x['Number of Apples'] / x['Number of People']})).unstack()
Given
df
Number of Apples Number of People
2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5
West Virginia 8 35 25 12 2 5 5 4
You can index on the first level to get sub-frames and then divide. The division will be auto-aligned on the columns.
df['Number of Apples'] / df['Number of People']
2017 2018 2019 2020
California 5.0 6.0 5.0 5.0
West Virginia 4.0 7.0 5.0 3.0
Append this back to your DataFrame:
pd.concat([df, pd.concat([df['Number of Apples'] / df['Number of People']], keys=['Result'], axis=1)], axis=1)
Number of Apples Number of People Result
2017 2018 2019 2020 2017 2018 2019 2020 2017 2018 2019 2020
California 10 18 20 25 2 3 4 5 5.0 6.0 5.0 5.0
West Virginia 8 35 25 12 2 5 5 4 4.0 7.0 5.0 3.0
This is fast since it is completely vectorized.

How to apply values in column based on condition in row values in python

I have a dataframe that looks like this:
df
Name year week date
0 Adam 2016 16 2016-04-24
1 Mary 2016 17 2016-05-01
2 Jane 2016 20 2016-05-22
3 Joe 2016 17 2016-05-01
4 Arthur 2017 44 2017-11-05
5 Liz 2017 41 2017-10-15
6 Janice 2016 47 2016-11-27
And I want to create column season so df['season'] that attributes a season MAM or OND depending on the value in week.
The result should look like this:
df_final
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
In essence, values of week that are below 40 should be paired with MAM and values above 40 should be OND.
So far I have this:
condition =df.week < 40
df['season'] = df[condition][[i for i in df.columns.values if i not in ['a']]].apply(lambda x: 'OND')
But it is clunky and does not produce the final response.
Thank you.
Use numpy.where:
condition = df.week < 40
df['season'] = np.where(condition, 'MAM', 'OND')
print (df)
Name year week date season
0 Adam 2016 16 2016-04-24 MAM
1 Mary 2016 17 2016-05-01 MAM
2 Jane 2016 20 2016-05-22 MAM
3 Joe 2016 17 2016-05-01 MAM
4 Arthur 2017 44 2017-11-05 OND
5 Liz 2017 41 2017-10-15 OND
6 Janice 2016 47 2016-11-27 OND
EDIT:
For convert strings to integers use astype:
condition = df.week.astype(int) < 40
Or convert column:
df.week = df.week.astype(int)
condition = df.week < 40

How to use group by on multiple columns?

I am using pandas for some data processing, My panda statement looks like this
yearage.groupby(['year', 'Tm']).size()
It gives me data like this
2014 ATL 9
BOS 9
BRK 7
CHI 10
CHO 9
CLE 8
DAL 9
DEN 8
DET 9
GSW 8
When I convert it into dataframe, I get only two columns compound key and the count. What I actually want is, three columns,
year, Tm, Size
How do I separate out the two compound keys after groupby?
You specify as_index=False in your groupby statement. As a side note, you probably want to use count (which excludes NaNs) instead of size.
>>> df.groupby(['year', 'Tm'], as_index=False).count()
year Tm a
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1
For size:
Another simple aggregation example is to compute the size of each group. This is included in GroupBy as the size method. It returns a Series whose index are the group names and whose values are the sizes of each group.
For count:
Compute count of group, excluding missing values
I think you can try reset_index with parameter name for new column name Size:
yearage.groupby(['year','Tm']).size().reset_index(name='Size')
Sample:
print yearage
year Tm a
0 2014 ATL 9
1 2014 ATL 9
2 2014 ATL 9
3 2014 ATL 9
4 2014 BOS 9
5 2014 BRK 7
6 2014 BOS 9
7 2014 BOS 9
8 2014 BOS 9
9 2014 CHI 10
10 2014 CHO 9
11 2014 CLE 8
12 2014 DAL 9
13 2014 DEN 8
14 2014 DET 9
15 2014 GSW 8
print yearage.groupby(['year','Tm']).size().reset_index(name='Size')
year Tm Size
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1
Without parameter name get new column 0:
print yearage.groupby(['year','Tm']).size().reset_index()
year Tm 0
0 2014 ATL 4
1 2014 BOS 4
2 2014 BRK 1
3 2014 CHI 1
4 2014 CHO 1
5 2014 CLE 1
6 2014 DAL 1
7 2014 DEN 1
8 2014 DET 1
9 2014 GSW 1

Adding columns of different length into pandas dataframe

I have a dataframe detailing money awarded to people over several years:
Name -- Money -- Year
Paul 57.00 2012
Susan 67.00 2012
Gary 54.00 2011
Paul 77.00 2011
Andrea 20.00 2011
Albert 23.00 2011
Hal 26.00 2010
Paul 23.00 2010
From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot:
2012 -- 2011 -- 2010
57.00 54.00 26.00
67.00 77.00 23.00
20.00
23.00
So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe.
Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values:
df["yindex"] = df.groupby("Year").cumcount()
new_df = df.pivot(index="yindex", columns="Year", values="Money")
For example:
>>> df = pd.read_csv("money.txt", sep="\s+")
>>> df
Name Money Year
0 Paul 57 2012
1 Susan 67 2012
2 Gary 54 2011
3 Paul 77 2011
4 Andrea 20 2011
5 Albert 23 2011
6 Hal 26 2010
7 Paul 23 2010
>>> df["yindex"] = df.groupby("Year").cumcount()
>>> df
Name Money Year yindex
0 Paul 57 2012 0
1 Susan 67 2012 1
2 Gary 54 2011 0
3 Paul 77 2011 1
4 Andrea 20 2011 2
5 Albert 23 2011 3
6 Hal 26 2010 0
7 Paul 23 2010 1
>>> df.pivot(index="yindex", columns="Year", values="Money")
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 NaN 20 NaN
3 NaN 23 NaN
After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is":
>>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0)
Year 2010 2011 2012
yindex
0 26 54 57
1 23 77 67
2 0 20 0
3 0 23 0

Categories