Selection over different columns after a groupby - python

I am new to pandas and hence please treat this question with patience
I have a Df with year, state and population data collected over many years and across many states
I want to find the max pop during any year and the corresponding state
example:
1995 Alabama xx; 1196 New York yy; 1997 Utah zz
I did a groupby and got the population for all the states in a year; How do i iterate over all the years
state_yearwise = df.groupby(["Year", "State"])["Pop"].max()
state_yearwise.head(10)
1990 Alabama 22.5
Arizona 29.4
Arkansas 16.2
California 34.1
2016 South Dakota 14.1
Tennessee 10.2
Texas 17.4
Utah 16.1
Now I did
df.loc[df.pop == df.pop.max(), ["year", "State", "pop"]]
1992 Colorado 54.1
give me only 1 year and the max over all years and states
What I want is per year which state had the max population
Suggestions?

You can use transform to get the max for each column and get the index of the corresponding pop
idx = df.groupby(['year'])['pop'].transform(max) == df['pop']
Now you can index the df using idx
df[idx]
You get
pop state year
2 210 B 2000
3 200 B 2001
For the other dataframe that you updated
Year State County Pop
0 2015 Mississippi Panola 6.4
1 2015 Mississippi Newton 6.7
2 2015 Mississippi Newton 6.7
3 2015 Utah Monroe 12.1
4 2013 Alabama Newton 10.4
5 2013 Alabama Georgi 4.2
idx = df.groupby(['Year'])['Pop'].transform(max) == df['Pop']
df[idx]
You get
Year State County Pop
3 2015 Utah Monroe 12.1
4 2013 Alabama Newton 10.4

Is this what you want:
df = pd.DataFrame([{'state' : 'A', 'year' : 2000, 'pop' : 100},
{'state' : 'A', 'year' : 2001, 'pop' : 110},
{'state' : 'B', 'year' : 2000, 'pop' : 210},
{'state' : 'B', 'year' : 2001, 'pop' : 200}])
maxpop = df.groupby("state",as_index=False)["pop"].max()
pd.merge(maxpop,df,how='inner')
I see for df:
pop state year
0 100 A 2000
1 110 A 2001
2 210 B 2000
3 200 B 2001
And for the final result:
state pop year
0 A 110 2001
1 B 210 2000
Proof this works:

Why not get rid of group by ? By using sort_values and drop_duplicates
df.sort_values(['state','pop']).drop_duplicates('state',keep='last')
Out[164]:
pop state year
1 110 A 2001
2 210 B 2000

Related

Python summing selected values in a column that match given condition

Here's the data after the preliminary data cleaning.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Chile
7
2001
Mexico
15
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Egypt
35
2002
Total
170
...
...
...
2010
US
32
...
...
...
What I want to get is the table below, which is summing up all countries except "US, Canada, France, and Japan" into 'others'. The list of countries varies every year from 2001 to 2010 so I want to use a for loop with if condition to loop over every year.
year
country
employees
2001
US
9
2001
Canada
81
2001
France
22
2001
Japan
31
2001
Others
22
2001
Total
165
2002
US
5
2002
Canada
80
2002
France
20
2002
Japan
30
2002
Others
35
2002
Total
170
Any leads would be greatly appreciated!
You may consider dropping Total from your dataframe.
However, as stated, your question can be solved by using Series.where to map away values that you don't recognize:
country = df["country"].where(df["country"].isin(["US", "Canada", "France", "Japan", "Total"]), "Others")
df.groupby([df["year"], country]).sum(numeric_only=True)

How to add null value rows into pandas dataframe for missing years in a multi-line chart plot

I am building a chart from a dataframe with a series of yearly values for six countries. This table is created by an SQL query and then passed to pandas with read_sql command...
country date value
0 CA 2000 123
1 CA 2001 125
2 US 1999 223
3 US 2000 235
4 US 2001 344
5 US 2002 355
...
Unfortunately, not every year has a value in each country, nevertheless the chart tool requires each country to have the same number of years in the dataframe. Years that have no values need a Nan (null) row added.
In the end, I want the pandas dataframe to look as follows for all six countries....
country date value
0 CA 1999 Nan
1 CA 2000 123
2 CA 2001 125
3 CA 2002 Nan
4 US 1999 223
5 US 2000 235
6 US 2001 344
7 US 2002 355
8 DE 1999 Nan
9 DE 2000 Nan
10 DE 2001 423
11 DE 2002 326
...
Are there any tools or shortcuts for determining min-max dates and then ensuring a new nan row is created if needed?
Use Series.unstack with DataFrame.stack trick:
df = df.set_index(['country','date']).unstack().stack(dropna=False).reset_index()
print (df)
country date value
0 CA 1999 NaN
1 CA 2000 123.0
2 CA 2001 125.0
3 CA 2002 NaN
4 US 1999 223.0
5 US 2000 235.0
6 US 2001 344.0
7 US 2002 355.0
Another idea with DataFrame.reindex:
mux = pd.MultiIndex.from_product([df['country'].unique(),
range(df['date'].min(), df['date'].max() + 1)],
names=['country','date'])
df = df.set_index(['country','date']).reindex(mux).reset_index()
print (df)
country date value
0 CA 1999 NaN
1 CA 2000 123.0
2 CA 2001 125.0
3 CA 2002 NaN
4 US 1999 223.0
5 US 2000 235.0
6 US 2001 344.0
7 US 2002 355.0

How to do a Group By and get a Percent Change of Revenue

I'm trying to do a group by and calculate percentage change of revenue? Here is my data frame
# Import pandas library
import pandas as pd
# initialize list of lists
data = [['1177 AVENUE OF THE AMERICAS',2020,10000], ['1177 AVENUE OF THE AMERICAS',2019,25000], ['1177 AVENUE OF THE AMERICAS',2018,5000], ['500 5th AVENUE',2020,30000], ['500 5th AVENUE',2019,5000],['500 5th AVENUE',2018,4000],['5 45th ST',2018,9000]]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['site_name', 'year', 'revenue'])
df.sort_values(['site_name','year'], inplace = True, ascending=[False, False])
# print dataframe.
df
I tried this:
df['Percent_Change'] = df.revenue.pct_change()
df
It gives me this:
site_name year revenue Percent_Change
3 500 5th AVENUE 2020 30000 NaN
4 500 5th AVENUE 2019 5000 -0.833333
5 500 5th AVENUE 2018 4000 -0.200000
6 5 45th ST 2018 9000 1.250000
0 1177 AVENUE OF THE AMERICAS 2020 10000 0.111111
1 1177 AVENUE OF THE AMERICAS 2019 25000 1.500000
2 1177 AVENUE OF THE AMERICAS 2018 5000 -0.800000
I also tried this:
df['Percent_Change'] = df.groupby(['site_name','year'])['revenue'].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
df
It gives me this:
site_name year revenue Percent_Change
3 500 5th AVENUE 2020 30000 0.0
4 500 5th AVENUE 2019 5000 0.0
5 500 5th AVENUE 2018 4000 0.0
6 5 45th ST 2018 9000 0.0
0 1177 AVENUE OF THE AMERICAS 2020 10000 0.0
1 1177 AVENUE OF THE AMERICAS 2019 25000 0.0
2 1177 AVENUE OF THE AMERICAS 2018 5000 0.0
The tricky part is to get the percent_change to reset when the site_name resets. I would like to end up with something like this.
Remove year
df.groupby(['site_name'])['revenue'].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))
Also , we usually do transform
s = df.groupby(['site_name'])['revenue'].transform('first')
df['Percent_Change'] = (df['revenue']/s-1)*100
You are grouping by both 'site_name' and 'year', hence the problem. I tried the code after removing 'year' and it gave the desired result.
df['Percent_Change'] = df.groupby(['site_name'])['revenue'].apply(lambda x: x.div(x.iloc[0]).subtract(1).mul(100))

Python pandas str.extract year information from unclean column

I have a DataFrame with over 111K rows. I'm trying to extract year information(19**, 20**) from uncleaned column Date and fill year info into a new Result column, some rows in Date column contains Chinese/English words.
df.Date.str.extract('20\d{2}') | df.Date.str.extract('19\d{2}')
I used str.extract() to match and extract the year but I got the ValueError: pattern contains no capture groups message. How can I get the year information and fill into a new Result column?
Rating Date
7.8 (June 22, 2000)
8.0 01 April, 1997
8.3 01 December, 1988
7.7 01 November, 2005
7.9 UMl Reprint University Illinois 1966 Ed
7.7 出版日期:2008-06
7.3 出版时间:2009.04
7.7 台北 : 橡樹林文化, 2006.
7.0 机械工业出版社; 第1版 (2014年11月13日)
8.1 民国57年(1968)
7.8 民国79 [1990]
8.9 2010-09-13
9.3 01 (2008)
8.8 1998年4月第11次印刷
7.9 2000
7.3 2004
Sample dataframe:
Date
0 2000
1 1998年4月第11次印刷
2 01 November, 2005
3 出版日期:2008-06
4 (June 22, 2000)
You can also do it as a one liner:
df['Year'] = df.Date.str.extract(r'(19\d{2}|20\d{2})')
Output:
Date Year
2000 2000
1998年4月第11次印刷 1998
01 November, 2005 2005
出版日期:2008-06 2008
(June 22, 2000) 2000
The error says the regex must have at least one capturing group, that is a sequence between a pair of parethesis.
In the solution I propose, I added a capturing group and two non-capturing ones. As you said the extracted data is then inserted into the Result column.
>>> df['Result'] = df.Date.str.extract(r'((?:19\d{2})|(?:20\d{2}))')
Rating Date Result
0 7.8 (June 22, 2000) 2000
1 8.0 01 April, 1997 1997
2 8.3 01 December, 1988 1988
3 7.7 01 November, 2005 2005
4 7.9 UMl Reprint University Illinois 1966 Ed 1966
5 7.7 出版日期:2008-06 2008
6 7.3 出版时间:2009.04 2009
7 7.7 �北 : 橡樹林文化, 2006. 2006
8 7.0 机械工业出版社; 第1版 (2014年11月13... 2014
9 8.1 民国57年(1968) 1968
10 7.8 民国79 [1990] 1990
11 8.9 2010-09-13 2010
12 9.3 01 (2008) 2008
13 8.8 1998年4月第11次�刷 1998
14 7.9 2000 2000
15 7.3 None NaN
Below Should the Job For you in the given case.
Just an example dataset:
>>> df
Date
0 2000
1 1998年4月第11次印刷
2 01 November, 2005
3 出版日期:2008-06
4 (June 22, 2000)
Solution:
>>> df.Date.str.extract(r'(\d{4})', expand=False)
0 2000
1 1998
2 2005
3 2008
4 2000
Or
>>> df['Year'] = df.Date.str.extract(r'(\d{4})', expand=False)
>>> df
Date Year
0 2000 2000
1 1998年4月第11次印刷 1998
2 01 November, 2005 2005
3 出版日期:2008-06 2008
4 (June 22, 2000) 2000
Another trick using assign , assigning values back to the new column Year.
>>> df = df.assign(Year = df.Date.str.extract(r'(\d{4})', expand=False))
>>> df
Date Year
0 2000 2000
1 1998年4月第11次印刷 1998
2 01 November, 2005 2005
3 出版日期:2008-06 2008
4 (June 22, 2000) 2000

pandas remove observations depending on multi-index level value

I have a multi-index data frame with levels 'id' and 'year':
value
id year
10 2001 100
2002 200
11 2001 110
12 2001 200
2002 300
13 2002 210
I want to keep the ids that have values for both years 2001 and 2002. This means I want to obtain:
value
id year
10 2001 100
2002 200
12 2001 200
2002 300
I know that df.loc[df.index.get_level_values('year') == 2002] works but I cannot extend that to account for both 2001 and 2002.
Thanks in advance.
How about use groupby and filter:
df.groupby(level=0).filter(
lambda df:np.in1d([2001, 2002], df.index.get_level_values(1)).all()
)

Categories