Pandas groupby and sum not giving correct values - python

I would like to group my data by country then by year and sum up the value columns using pandas. Currently I am reading in the csv file and using the following:
data_cleaned= df.groupby(['Country', 'year'], as_index=False).sum()
Here is a sample of my dataset:
Country year value
Angola 2009 0
Angola 2009 0
Angola 2010 0
Angola 2010 0
Angola 2010 0
Angola 2010 0
Angola 2011 0
Angola 2011 0
Angola 2011 0
Angola 2011 0
Angola 2012 118
Angola 2012 0
Angola 2012 0
Angola 2012 0
Angola 2013 0
Angola 2013 0
Angola 2013 0
Angola 2013 0
Angola 2014 0
Angola 2014 0
Angola 2014 0
Angola 2014 0
Angola 2015 0
Angola 2015 0
Angola 2015 0
Angola 2015 0
Angola 2016 0
Angola 2016 0
Angola 2016 0
Angola 2016 0
Angola 2017 0
Australia 2009 0
Australia 2009 14
Australia 2009 0
Australia 2009 12
Australia 2010 0
Australia 2010 0
Australia 2010 54
Australia 2010 6
Australia 2011 0
Australia 2011 4
Australia 2011 17
Australia 2011 13
Australia 2012 8
Australia 2012 2
Australia 2012 4
Australia 2012 105
Australia 2013 0
Australia 2013 5
Australia 2013 0
Australia 2013 0
Australia 2014 0
Australia 2014 0
Australia 2014 0
Australia 2014 0
Australia 2015 0
Australia 2015 0
Australia 2015 0
Australia 2015 0
Australia 2016 0
Australia 2016 0
Australia 2016 0
Australia 2016 0
Australia 2017 0
But I get the following results:
Partner Country year value
0 Angola 2009 0.00
1 Angola 2010 0.00
2 Angola 2011 0.00
3 Angola 2012 86,280.00
4 Angola 2013 0.00
5 Angola 2014 0.00
6 Angola 2015 0.00
7 Angola 2016 0.00
8 Angola 2017 0.00
9 Australia 2009 54,879.00
10 Australia 2010 67,899.00
11 Australia 2011 50,965.00
12 Australia 2012 332,128.00
13 Australia 2013 16,515.00
14 Australia 2014 0.00
15 Australia 2015 0.00
16 Australia 2016 0.00
17 Australia 2017 0.00
Which is obviously wrong since Angola only has one non-zero value and it's in 2012, which is the correct year to have a value but I'm expecting 118 instead of 86,280.00. Could someone maybe point out what I am doing wrong and how I can correctly sum the value column based on the Country and year columns?

Related

Python pandas Dataframe column to rows manipulation [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
I'm trying to transpose a few columns while keeping the other columns. I'm having a hard time with pivot codes or transpose codes as it doesn't really give me the output I need.
Can anyone help?
I have this data frame:
EmpID
Goal
week 1
week 2
week 3
week 4
1
556
54
33
24
54
2
342
32
32
56
43
3
534
43
65
64
21
4
244
45
87
5
22
My expected dataframe output is:
EmpID
Goal
Weeks
Actual
1
556
week 1
54
1
556
week 2
33
1
556
week 3
24
1
556
week 4
54
and so on until the full employee IDs are listed..
Something like this.
# Python - melt DF
import pandas as pd
d = {'Country Code': [1960, 1961, 1962, 1963, 1964, 1965, 1966],
'ABW': [2.615300, 2.734390, 2.678430, 2.929920, 2.963250, 3.060540, 4.349760],
'AFG': [0.249760, 0.218480, 0.210840, 0.217240, 0.211410, 0.209910, 0.671330],
'ALB': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 1.12214]}
df = pd.DataFrame(data=d)
print(df)
df1 = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
print(df1)
df2 = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print(df2)
# BEFORE
ABW AFG ALB Country Code
0 2.61530 0.24976 NaN 1960
1 2.73439 0.21848 NaN 1961
2 2.67843 0.21084 NaN 1962
3 2.92992 0.21724 NaN 1963
4 2.96325 0.21141 NaN 1964
5 3.06054 0.20991 NaN 1965
6 4.34976 0.67133 1.12214 1966
# AFTER
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Also, take a look at the link below, for more info.
https://www.dataindependent.com/pandas/pandas-melt/

Applying rolling median across row for pandas dataframe

I would like to apply a rolling median to replace NaN values in the following dataframe, with a window size of 3:
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
17 366000.0 278000.0 330000.0 NaN 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 635300.0 690600.0 800000.0 NaN 821500.0 ... 850800.0 905000.0 947500.0 1016500.0 1043900.0 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 NaN NaN
However pandas rolling function seems to work for columns and not along a row. How can i fix this? Also, the solution should NOT change any of the non NAN values in that row
First compute the rolling medians by using rolling() with axis=1 (row-wise), min_periods=0 (to handle NaN), and closed='both' (otherwise left edge gets excluded).
Then replace only the NaN entries with these medians by using fillna().
medians = df.rolling(3, min_periods=0, closed='both', axis=1).median()
df = df.fillna(medians)
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
# 17 366000.0 278000.0 330000.0 330000.0 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 ... 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 1580000.0 1589500.0

Pandas DataFrame: Dropping rows after meeting conditions in columns

I have a large panel data in a pandas DataFrame:
import pandas as pd
df = pd.read_csv('Qs_example_data.csv')
df.head()
ID Year DOB status YOD
223725 1991 1975.0 No 2021
223725 1992 1975.0 No 2021
223725 1993 1975.0 No 2021
223725 1994 1975.0 No 2021
223725 1995 1975.0 No 2021
I want to drop the rows based on the following condition:
If the value in YOD matches the value in Year then all rows after that matching row for that ID are dropped, or if a Yes is observed in the column status for that ID.
For example in the DataFrame, ID 68084329 has the values 2012 in the DOB and YOD columns on row 221930. All rows after 221930 for 68084329 should be dropped.
df.loc[x['ID'] == 68084329]
ID Year DOB status YOD
221910 68084329 1991 1942.0 No 2012
221911 68084329 1992 1942.0 No 2012
221912 68084329 1993 1942.0 No 2012
221913 68084329 1994 1942.0 No 2012
221914 68084329 1995 1942.0 No 2012
221915 68084329 1996 1942.0 No 2012
221916 68084329 1997 1942.0 No 2012
221917 68084329 1998 1942.0 No 2012
221918 68084329 1999 1942.0 No 2012
221919 68084329 2000 1942.0 No 2012
221920 68084329 2001 1942.0 No 2012
221921 68084329 2002 1942.0 No 2012
221922 68084329 2003 1942.0 No 2012
221923 68084329 2004 1942.0 No 2012
221924 68084329 2005 1942.0 No 2012
221925 68084329 2006 1942.0 No 2012
221926 68084329 2007 1942.0 No 2012
221927 68084329 2008 1942.0 No 2012
221928 68084329 2010 1942.0 No 2012
221929 68084329 2011 1942.0 No 2012
221930 68084329 2012 1942.0 Yes 2012
221931 68084329 2013 1942.0 No 2012
221932 68084329 2014 1942.0 No 2012
221933 68084329 2015 1942.0 No 2012
221934 68084329 2016 1942.0 No 2012
221935 68084329 2017 1942.0 No 2012
I have a lot of IDs that have rows which need to be dropped in accordance with the above condition. How do I do this?
The following code should also work:
result=df[0:0]
ids=[]
for i in df.ID:
if i not in ids:
ids.append(i)
for k in ids:
temp=df[df.ID==k]
for j in range(len(temp)):
result=pd.concat([result, temp.iloc[j:j+1, :]])
if temp.iloc[j, :]['status']=='Yes':
break
print(result)
This should do. From your wording, it wasn't clear whether you need to "drop all the rows after you encounter a Yes for that ID", or "just the rows you encounter a Yes in". I assumed that you need to "drop all the rows after you encounter a Yes for that ID".
import pandas as pd
def __get_nos__(df):
return df.iloc[0:(df['Status'] != 'Yes').values.argmin(), :]
df = pd.DataFrame()
df['ID'] = [12345678]*10 + [13579]*10
df['Year'] = list(range(2000, 2010))*2
df['DOB'] = list(range(2000, 2010))*2
df['YOD'] = list(range(2000, 2010))*2
df['Status'] = ['No']*5 + ['Yes']*5 + ['No']*7 + ['Yes']*3
""" df
ID Year DOB YOD Status
0 12345678 2000 2000 2000 No
1 12345678 2001 2001 2001 No
2 12345678 2002 2002 2002 No
3 12345678 2003 2003 2003 No
4 12345678 2004 2004 2004 No
5 12345678 2005 2005 2005 Yes
6 12345678 2006 2006 2006 Yes
7 12345678 2007 2007 2007 Yes
8 12345678 2008 2008 2008 Yes
9 12345678 2009 2009 2009 Yes
10 13579 2000 2000 2000 No
11 13579 2001 2001 2001 No
12 13579 2002 2002 2002 No
13 13579 2003 2003 2003 No
14 13579 2004 2004 2004 No
15 13579 2005 2005 2005 No
16 13579 2006 2006 2006 No
17 13579 2007 2007 2007 Yes
18 13579 2008 2008 2008 Yes
19 13579 2009 2009 2009 Yes
"""
df.groupby('ID').apply(lambda x: __get_nos__(x)).reset_index(drop=True)
""" Output
ID Year DOB YOD Status
0 13579 2000 2000 2000 No
1 13579 2001 2001 2001 No
2 13579 2002 2002 2002 No
3 13579 2003 2003 2003 No
4 13579 2004 2004 2004 No
5 13579 2005 2005 2005 No
6 13579 2006 2006 2006 No
7 12345678 2000 2000 2000 No
8 12345678 2001 2001 2001 No
9 12345678 2002 2002 2002 No
10 12345678 2003 2003 2003 No
11 12345678 2004 2004 2004 No
"""

Loop only takes last value

I have a dataFrame with country-specific population for each year and a pandas Series with the world population for each year.
This is the Series I am using:
pop_tot = df3.groupby('Year')['population'].sum()
Year
1990 4.575442e+09
1991 4.659075e+09
1992 4.699921e+09
1993 4.795129e+09
1994 4.862547e+09
1995 4.949902e+09
... ...
2017 6.837429e+09
and this is the DataFrame I am using
Country Year HDI population
0 Afghanistan 1990 NaN 1.22491e+07
1 Albania 1990 0.645 3.28654e+06
2 Algeria 1990 0.577 2.59124e+07
3 Andorra 1990 NaN 54509
4 Angola 1990 NaN 1.21714e+07
... ... ... ... ...
4096 Uzbekistan 2017 0.71 3.23872e+07
4097 Vanuatu 2017 0.603 276244
4098 Zambia 2017 0.588 1.70941e+07
4099 Zimbabwe 2017 0.535 1.65299e+07
I want to calculate the proportion of the world's population that the population of that country represents for each year, so I loop over the Series and the DataFrame as follows:
j = 0
for i in range(len(df3)):
if df3.iloc[i,1]==pop_tot.index[j]:
df3['pop_tot']=pop_tot[j] #Sanity check
df3['weighted']=df3['population']/pop_tot[j]
*df3.iloc[i,2]
else:
j=j+1
However, the DataFrame that I get in return is not the expected one. I end up dividing all the values by the total population of 2017, thus giving me proportions which are not the correct ones for that year (i.e. for this first rows, pop_tot should be 4.575442e+09 as it corresponds to 1990 according to the Series above and not 6.837429e+09 which corresponds to 2017).
Country Year HDI population pop_tot weighted
0 Albania 1990 0.645 3.28654e+06 6.837429e+09 0.000257158
1 Algeria 1990 0.577 2.59124e+07 6.837429e+09 0.00202753
2 Argentina 1990 0.704 3.27297e+07 6.837429e+09 0.00256096
I can't see however what's the mistake in the loop.
Thanks in advance.
You don't need loop, you can use groupby.transform to create the column pop_tot in df3 directly. then for the column weighted just do column operation, such as:
df3['pop_tot'] = df3.groupby('Year')['population'].transform(sum)
df3['weighted'] = df3['population']/df3['pop_tot']
As #roganjosh pointed out, the problem with your method is that you replace the whole columns pop_tot and weighted everytime your condition if is met, so at the last iteration where this condition is met, the year being probably 2017, you define the value of the column pop_tot being the one of 2017 and calculate the weithed with this value as well.
You dont have to loop, its slower and can make things really complex quite fast. Use pandas and numpys vectorized solutions like this for example:
df['pop_tot'] = df.population.sum()
df['weighted'] = df.population / df.population.sum()
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
Edit after OP's comment
df['pop_tot'] = df.groupby('Year').population.transform('sum')
df['weighted'] = df.population / df['pop_tot']
print(df)
Country Year HDI population pop_tot weighted
0 Afghanistan 1990 NaN 12249100.0 53673949.0 0.228213
1 Albania 1990 0.645 3286540.0 53673949.0 0.061232
2 Algeria 1990 0.577 25912400.0 53673949.0 0.482774
3 Andorra 1990 NaN 54509.0 53673949.0 0.001016
4 Angola 1990 NaN 12171400.0 53673949.0 0.226766
note
I used the small dataset you gave as example:
Country Year HDI population
0 Afghanistan 1990 NaN 12249100.0
1 Albania 1990 0.645 3286540.0
2 Algeria 1990 0.577 25912400.0
3 Andorra 1990 NaN 54509.0
4 Angola 1990 NaN 12171400.0

pandas DataFrame .stack(dropna=False) but keeping existing combinations of levels

My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN

Categories