Pandas DataFrame: Dropping rows after meeting conditions in columns - python

I have a large panel data in a pandas DataFrame:
import pandas as pd
df = pd.read_csv('Qs_example_data.csv')
df.head()
ID Year DOB status YOD
223725 1991 1975.0 No 2021
223725 1992 1975.0 No 2021
223725 1993 1975.0 No 2021
223725 1994 1975.0 No 2021
223725 1995 1975.0 No 2021
I want to drop the rows based on the following condition:
If the value in YOD matches the value in Year then all rows after that matching row for that ID are dropped, or if a Yes is observed in the column status for that ID.
For example in the DataFrame, ID 68084329 has the values 2012 in the DOB and YOD columns on row 221930. All rows after 221930 for 68084329 should be dropped.
df.loc[x['ID'] == 68084329]
ID Year DOB status YOD
221910 68084329 1991 1942.0 No 2012
221911 68084329 1992 1942.0 No 2012
221912 68084329 1993 1942.0 No 2012
221913 68084329 1994 1942.0 No 2012
221914 68084329 1995 1942.0 No 2012
221915 68084329 1996 1942.0 No 2012
221916 68084329 1997 1942.0 No 2012
221917 68084329 1998 1942.0 No 2012
221918 68084329 1999 1942.0 No 2012
221919 68084329 2000 1942.0 No 2012
221920 68084329 2001 1942.0 No 2012
221921 68084329 2002 1942.0 No 2012
221922 68084329 2003 1942.0 No 2012
221923 68084329 2004 1942.0 No 2012
221924 68084329 2005 1942.0 No 2012
221925 68084329 2006 1942.0 No 2012
221926 68084329 2007 1942.0 No 2012
221927 68084329 2008 1942.0 No 2012
221928 68084329 2010 1942.0 No 2012
221929 68084329 2011 1942.0 No 2012
221930 68084329 2012 1942.0 Yes 2012
221931 68084329 2013 1942.0 No 2012
221932 68084329 2014 1942.0 No 2012
221933 68084329 2015 1942.0 No 2012
221934 68084329 2016 1942.0 No 2012
221935 68084329 2017 1942.0 No 2012
I have a lot of IDs that have rows which need to be dropped in accordance with the above condition. How do I do this?

The following code should also work:
result=df[0:0]
ids=[]
for i in df.ID:
if i not in ids:
ids.append(i)
for k in ids:
temp=df[df.ID==k]
for j in range(len(temp)):
result=pd.concat([result, temp.iloc[j:j+1, :]])
if temp.iloc[j, :]['status']=='Yes':
break
print(result)

This should do. From your wording, it wasn't clear whether you need to "drop all the rows after you encounter a Yes for that ID", or "just the rows you encounter a Yes in". I assumed that you need to "drop all the rows after you encounter a Yes for that ID".
import pandas as pd
def __get_nos__(df):
return df.iloc[0:(df['Status'] != 'Yes').values.argmin(), :]
df = pd.DataFrame()
df['ID'] = [12345678]*10 + [13579]*10
df['Year'] = list(range(2000, 2010))*2
df['DOB'] = list(range(2000, 2010))*2
df['YOD'] = list(range(2000, 2010))*2
df['Status'] = ['No']*5 + ['Yes']*5 + ['No']*7 + ['Yes']*3
""" df
ID Year DOB YOD Status
0 12345678 2000 2000 2000 No
1 12345678 2001 2001 2001 No
2 12345678 2002 2002 2002 No
3 12345678 2003 2003 2003 No
4 12345678 2004 2004 2004 No
5 12345678 2005 2005 2005 Yes
6 12345678 2006 2006 2006 Yes
7 12345678 2007 2007 2007 Yes
8 12345678 2008 2008 2008 Yes
9 12345678 2009 2009 2009 Yes
10 13579 2000 2000 2000 No
11 13579 2001 2001 2001 No
12 13579 2002 2002 2002 No
13 13579 2003 2003 2003 No
14 13579 2004 2004 2004 No
15 13579 2005 2005 2005 No
16 13579 2006 2006 2006 No
17 13579 2007 2007 2007 Yes
18 13579 2008 2008 2008 Yes
19 13579 2009 2009 2009 Yes
"""
df.groupby('ID').apply(lambda x: __get_nos__(x)).reset_index(drop=True)
""" Output
ID Year DOB YOD Status
0 13579 2000 2000 2000 No
1 13579 2001 2001 2001 No
2 13579 2002 2002 2002 No
3 13579 2003 2003 2003 No
4 13579 2004 2004 2004 No
5 13579 2005 2005 2005 No
6 13579 2006 2006 2006 No
7 12345678 2000 2000 2000 No
8 12345678 2001 2001 2001 No
9 12345678 2002 2002 2002 No
10 12345678 2003 2003 2003 No
11 12345678 2004 2004 2004 No
"""

Related

Applying rolling median across row for pandas dataframe

I would like to apply a rolling median to replace NaN values in the following dataframe, with a window size of 3:
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
17 366000.0 278000.0 330000.0 NaN 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 635300.0 690600.0 800000.0 NaN 821500.0 ... 850800.0 905000.0 947500.0 1016500.0 1043900.0 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 NaN NaN
However pandas rolling function seems to work for columns and not along a row. How can i fix this? Also, the solution should NOT change any of the non NAN values in that row
First compute the rolling medians by using rolling() with axis=1 (row-wise), min_periods=0 (to handle NaN), and closed='both' (otherwise left edge gets excluded).
Then replace only the NaN entries with these medians by using fillna().
medians = df.rolling(3, min_periods=0, closed='both', axis=1).median()
df = df.fillna(medians)
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
# 17 366000.0 278000.0 330000.0 330000.0 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 ... 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 1580000.0 1589500.0

pandas DataFrame .stack(dropna=False) but keeping existing combinations of levels

My data looks like this
import numpy as np
import pandas as pd
# My Data
enroll_year = np.arange(2010, 2015)
grad_year = enroll_year + 4
n_students = [[100, 100, 110, 110, np.nan]]
df = pd.DataFrame(
n_students,
columns=pd.MultiIndex.from_arrays(
[enroll_year, grad_year],
names=['enroll_year', 'grad_year']))
print(df)
# enroll_year 2010 2011 2012 2013 2014
# grad_year 2014 2015 2016 2017 2018
# 0 100 100 110 110 NaN
What I am trying to do is to stack the data, one column/index level for year of enrollment, one for year of graduation and one for the numbers of students, which should look like
# enroll_year grad_year n
# 2010 2014 100.0
# . . .
# . . .
# . . .
# 2014 2018 NaN
The data produced by .stack() is very close, but the missing record(s) is dropped,
df1 = df.stack(['enroll_year', 'grad_year'])
df1.index = df1.index.droplevel(0)
print(df1)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# dtype: float64
So, .stack(dropna=False) is tried, but it will expand the index levels to all combinations of enrollment and graduation years
df2 = df.stack(['enroll_year', 'grad_year'], dropna=False)
df2.index = df2.index.droplevel(0)
print(df2)
# enroll_year grad_year
# 2010 2014 100.0
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2011 2014 NaN
# 2015 100.0
# 2016 NaN
# 2017 NaN
# 2018 NaN
# 2012 2014 NaN
# 2015 NaN
# 2016 110.0
# 2017 NaN
# 2018 NaN
# 2013 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 110.0
# 2018 NaN
# 2014 2014 NaN
# 2015 NaN
# 2016 NaN
# 2017 NaN
# 2018 NaN
# dtype: float64
And I need to subset df2 to get my desired data set.
existing_combn = list(zip(
df.columns.levels[0][df.columns.labels[0]],
df.columns.levels[1][df.columns.labels[1]]))
df3 = df2.loc[existing_combn]
print(df3)
# enroll_year grad_year
# 2010 2014 100.0
# 2011 2015 100.0
# 2012 2016 110.0
# 2013 2017 110.0
# 2014 2018 NaN
# dtype: float64
Although it only adds a few more extra lines to my code, I wonder if there are any better and neater approaches.
Use unstack with pd.DataFrame then reset_index and drop unnecessary columns and rename the column as:
pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Or:
df.unstack().reset_index(level=2, drop=True)
enroll_year grad_year
2010 2014 100.0
2011 2015 100.0
2012 2016 110.0
2013 2017 110.0
2014 2018 NaN
dtype: float64
Or:
df.unstack().reset_index(level=2, drop=True).reset_index().rename(columns={0:'n'})
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
Explanation :
print(pd.DataFrame(df.unstack()))
0
enroll_year grad_year
2010 2014 0 100.0
2011 2015 0 100.0
2012 2016 0 110.0
2013 2017 0 110.0
2014 2018 0 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1))
enroll_year grad_year 0
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN
print(pd.DataFrame(df.unstack()).reset_index().drop('level_2',axis=1).rename(columns={0:'n'}))
enroll_year grad_year n
0 2010 2014 100.0
1 2011 2015 100.0
2 2012 2016 110.0
3 2013 2017 110.0
4 2014 2018 NaN

Pandas groupby and sum not giving correct values

I would like to group my data by country then by year and sum up the value columns using pandas. Currently I am reading in the csv file and using the following:
data_cleaned= df.groupby(['Country', 'year'], as_index=False).sum()
Here is a sample of my dataset:
Country year value
Angola 2009 0
Angola 2009 0
Angola 2010 0
Angola 2010 0
Angola 2010 0
Angola 2010 0
Angola 2011 0
Angola 2011 0
Angola 2011 0
Angola 2011 0
Angola 2012 118
Angola 2012 0
Angola 2012 0
Angola 2012 0
Angola 2013 0
Angola 2013 0
Angola 2013 0
Angola 2013 0
Angola 2014 0
Angola 2014 0
Angola 2014 0
Angola 2014 0
Angola 2015 0
Angola 2015 0
Angola 2015 0
Angola 2015 0
Angola 2016 0
Angola 2016 0
Angola 2016 0
Angola 2016 0
Angola 2017 0
Australia 2009 0
Australia 2009 14
Australia 2009 0
Australia 2009 12
Australia 2010 0
Australia 2010 0
Australia 2010 54
Australia 2010 6
Australia 2011 0
Australia 2011 4
Australia 2011 17
Australia 2011 13
Australia 2012 8
Australia 2012 2
Australia 2012 4
Australia 2012 105
Australia 2013 0
Australia 2013 5
Australia 2013 0
Australia 2013 0
Australia 2014 0
Australia 2014 0
Australia 2014 0
Australia 2014 0
Australia 2015 0
Australia 2015 0
Australia 2015 0
Australia 2015 0
Australia 2016 0
Australia 2016 0
Australia 2016 0
Australia 2016 0
Australia 2017 0
But I get the following results:
Partner Country year value
0 Angola 2009 0.00
1 Angola 2010 0.00
2 Angola 2011 0.00
3 Angola 2012 86,280.00
4 Angola 2013 0.00
5 Angola 2014 0.00
6 Angola 2015 0.00
7 Angola 2016 0.00
8 Angola 2017 0.00
9 Australia 2009 54,879.00
10 Australia 2010 67,899.00
11 Australia 2011 50,965.00
12 Australia 2012 332,128.00
13 Australia 2013 16,515.00
14 Australia 2014 0.00
15 Australia 2015 0.00
16 Australia 2016 0.00
17 Australia 2017 0.00
Which is obviously wrong since Angola only has one non-zero value and it's in 2012, which is the correct year to have a value but I'm expecting 118 instead of 86,280.00. Could someone maybe point out what I am doing wrong and how I can correctly sum the value column based on the Country and year columns?

pandas add columns conditions with groupby and on another column values

I have pandas.DataFrame called companysubset like below, but actual data is much longer.
conm fyear dvpayout industry firmycount ipodate
46078 CAESARS ENTERTAINMENT CORP 2003 0.226813 Services 22 19891213.0
46079 CAESARS ENTERTAINMENT CORP 2004 0.226813 Services 22 19891213.0
46080 CAESARS ENTERTAINMENT CORP 2005 0.226813 Services 22 19891213.0
46091 CAESARS ENTERTAINMENT CORP 2016 0.226813 Services 22 19891213.0
114620 CAESARSTONE LTD 2010 0.487543 Manufacturing 10 20120322.0
114621 CAESARSTONE LTD 2011 0.487543 Manufacturing 10 20120322.0
114622 CAESARSTONE LTD 2012 0.487543 Manufacturing 10 20120322.0
114623 CAESARSTONE LTD 2013 0.487543 Manufacturing 10 20120322.0
114624 CAESARSTONE LTD 2014 0.487543 Manufacturing 10 20120322.0
114625 CAESARSTONE LTD 2015 0.487543 Manufacturing 10 20120322.0
114626 CAESARSTONE LTD 2016 0.487543 Manufacturing 10 20120322.0
132524 CAFEPRESS INC 2010 0.000000 Retail Trade 7 20120329.0
132525 CAFEPRESS INC 2011 0.000000 Retail Trade 7 20120329.0
132526 CAFEPRESS INC 2012 -0.000000 Retail Trade 7 20120329.0
132527 CAFEPRESS INC 2013 -0.000000 Retail Trade 7 20120329.0
132528 CAFEPRESS INC 2014 -0.000000 Retail Trade 7 20120329.0
132529 CAFEPRESS INC 2015 -0.000000 Retail Trade 7 20120329.0
132530 CAFEPRESS INC 2016 -0.000000 Retail Trade 7 20120329.0
120049 CAI INTERNATIONAL INC 2005 0.000000 Services 12 20070516.0
120050 CAI INTERNATIONAL INC 2006 0.000000 Services 12 20070516.0
3896 CALAMP CORP 1999 -0.000000 Manufacturing 23 NaN
3897 CALAMP CORP 2000 0.000000 Manufacturing 23 NaN
3898 CALAMP CORP 2001 0.000000 Manufacturing 23 NaN
3899 CALAMP CORP 2002 0.000000 Manufacturing 23 NaN
21120 CALATLANTIC GROUP INC 1995 -0.133648 Construction 22 NaN
21121 CALATLANTIC GROUP INC 1996 -0.133648 Construction 22 NaN
21122 CALATLANTIC GROUP INC 1997 -0.133648 Construction 22 NaN
21123 CALATLANTIC GROUP INC 1998 -0.133648 Construction 22 NaN
21124 CALATLANTIC GROUP INC 1999 -0.133648 Construction 22 NaN
21125 CALATLANTIC GROUP INC 2000 -0.133648 Construction 22 NaN
21126 CALATLANTIC GROUP INC 2001 -0.133648 Construction 22 NaN
21127 CALATLANTIC GROUP INC 2002 -0.133648 Construction 22 NaN
21128 CALATLANTIC GROUP INC 2003 -0.133648 Construction 22 NaN
1) I want to calculate quartile of dvpayout of company by industry and add column called dv and indicate that it is in Q1, Q2, Q3 or Q4.
I came up with this code, but it does not work.
pd.cut(companysubset['dvpayout'].mean(), bins=[0,25,75,100], labels=False)
2) I want to add column called age if there is an ipodate. The value will be the largest fyear - ipodate of year. (ex. 2016 - 1989 for CAESARS ENTERTAINMENT COR)
The results data frame I want to see is like below.
conm fyear dvpayout industry firmycount ipodate dv age
46078 CAESARS ... 2003 0.226813 Services 22 19891213.0 Q2 27
46079 CAESARS ... 2004 0.226813 Services 22 19891213.0 Q2 27
46080 CAESARS ... 2005 0.226813 Services 22 19891213.0 Q2 27
46091 CAESARS ... 2016 0.226813 Services 22 19891213.0 Q2 27
114620 CAESARSTONE LTD 2010 0.487543 Manufacturing 10 20120322.0 Q3 4
114621 CAESARSTONE LTD 2011 0.487543 Manufacturing 10 20120322.0 Q3 4
114622 CAESARSTONE LTD 2012 0.487543 Manufacturing 10 20120322.0 Q3 4
114623 CAESARSTONE LTD 2013 0.487543 Manufacturing 10 20120322.0 Q3 4
114624 CAESARSTONE LTD 2014 0.487543 Manufacturing 10 20120322.0 Q3 4
114625 CAESARSTONE LTD 2015 0.487543 Manufacturing 10 20120322.0 Q3 4
114626 CAESARSTONE LTD 2016 0.487543 Manufacturing 10 20120322.0 Q3 4
132524 CAFEPRESS INC 2010 0.000000 Retail Trade 7 20120329.0 Q1 4
132525 CAFEPRESS INC 2011 0.000000 Retail Trade 7 20120329.0 Q1 4
132526 CAFEPRESS INC 2012 -0.000000 Retail Trade 7 20120329.0 Q1 4
132527 CAFEPRESS INC 2013 -0.000000 Retail Trade 7 20120329.0 Q1 4
132528 CAFEPRESS INC 2014 -0.000000 Retail Trade 7 20120329.0 Q1 4
132529 CAFEPRESS INC 2015 -0.000000 Retail Trade 7 20120329.0 Q1 4
132530 CAFEPRESS INC 2016 -0.000000 Retail Trade 7 20120329.0 Q1 4
120049 CAI INTERNATIONAL INC 2006 0.000000 Services 12 20070516.0 Q1 0
120050 CAI INTERNATIONAL INC 2007 0.000000 Services 12 20070516.0 Q1 0
3896 CALAMP CORP 1999 -0.000000 Manufacturing 23 NaN Q1 Nan
3897 CALAMP CORP 2000 0.000000 Manufacturing 23 NaN Q1 Nan
3898 CALAMP CORP 2001 0.000000 Manufacturing 23 NaN Q1 Nan
3899 CALAMP CORP 2002 0.000000 Manufacturing 23 NaN Q1 Nan
21120 CALATLANTIC GROUP INC 1995 -0.133648 Construction 22 NaN Q1 Nan
21121 CALATLANTIC GROUP INC 1996 -0.133648 Construction 22 NaN Q1 Nan
21122 CALATLANTIC GROUP INC 1997 -0.133648 Construction 22 NaN Q1 Nan
21123 CALATLANTIC GROUP INC 1998 -0.133648 Construction 22 NaN Q1 Nan
21124 CALATLANTIC GROUP INC 1999 -0.133648 Construction 22 NaN Q1 Nan
21125 CALATLANTIC GROUP INC 2000 -0.133648 Construction 22 NaN Q1 Nan
21126 CALATLANTIC GROUP INC 2001 -0.133648 Construction 22 NaN Q1 Nan
21127 CALATLANTIC GROUP INC 2002 -0.133648 Construction 22 NaN Q1 Nan
21128 CALATLANTIC GROUP INC 2003 -0.133648 Construction 22 NaN Q1 Nan
Thanks in advance!!!!
The age column can be generated with:
Code
df.set_index(['conm'], inplace=True)
df['age'] = df.groupby(level=0).apply(
lambda x: max(x.fyear) - round(x.ipodate.iloc[0]/10000-0.5))
Test Code:
df = pd.read_fwf(StringIO(
u"""
ID conm fyear ipodate
46078 CAESARS ENTERTAINMENT 2003 19891213.0
46079 CAESARS ENTERTAINMENT 2004 19891213.0
46080 CAESARS ENTERTAINMENT 2005 19891213.0
46091 CAESARS ENTERTAINMENT 2016 19891213.0
114620 CAESARSTONE LTD 2010 20120322.0
114621 CAESARSTONE LTD 2011 20120322.0
114622 CAESARSTONE LTD 2012 20120322.0
114623 CAESARSTONE LTD 2013 20120322.0
114624 CAESARSTONE LTD 2014 20120322.0
114625 CAESARSTONE LTD 2015 20120322.0
114626 CAESARSTONE LTD 2016 20120322.0
132524 CAFEPRESS INC 2010 20120329.0
132525 CAFEPRESS INC 2011 20120329.0
132526 CAFEPRESS INC 2012 20120329.0
132527 CAFEPRESS INC 2013 20120329.0
132528 CAFEPRESS INC 2014 20120329.0
132529 CAFEPRESS INC 2015 20120329.0
132530 CAFEPRESS INC 2016 20120329.0
120049 CAI INTERNATIONAL INC 2005 20070516.0
120050 CAI INTERNATIONAL INC 2006 20070516.0
3897 CALAMP CORP 2000 NaN
3898 CALAMP CORP 2001 NaN
3896 CALAMP CORP 1999 NaN
3899 CALAMP CORP 2002 NaN
21120 CALATLANTIC GROUP INC 1995 NaN
21121 CALATLANTIC GROUP INC 1996 NaN
21122 CALATLANTIC GROUP INC 1997 NaN
21123 CALATLANTIC GROUP INC 1998 NaN
21124 CALATLANTIC GROUP INC 1999 NaN
21125 CALATLANTIC GROUP INC 2000 NaN
21126 CALATLANTIC GROUP INC 2001 NaN
21127 CALATLANTIC GROUP INC 2002 NaN
21128 CALATLANTIC GROUP INC 2003 NaN"""),
header=1)
df.set_index(['conm'], inplace=True)
df['age'] = df.groupby(level=0).apply(
lambda x: max(x.fyear) - round(x.ipodate.iloc[0]/10000-0.5))
print(df)
Results:
ID fyear ipodate age
conm
CAESARS ENTERTAINMENT 46078 2003 19891213.0 27.0
CAESARS ENTERTAINMENT 46079 2004 19891213.0 27.0
CAESARS ENTERTAINMENT 46080 2005 19891213.0 27.0
CAESARS ENTERTAINMENT 46091 2016 19891213.0 27.0
CAESARSTONE LTD 114620 2010 20120322.0 4.0
CAESARSTONE LTD 114621 2011 20120322.0 4.0
CAESARSTONE LTD 114622 2012 20120322.0 4.0
CAESARSTONE LTD 114623 2013 20120322.0 4.0
CAESARSTONE LTD 114624 2014 20120322.0 4.0
CAESARSTONE LTD 114625 2015 20120322.0 4.0
CAESARSTONE LTD 114626 2016 20120322.0 4.0
CAFEPRESS INC 132524 2010 20120329.0 4.0
CAFEPRESS INC 132525 2011 20120329.0 4.0
CAFEPRESS INC 132526 2012 20120329.0 4.0
CAFEPRESS INC 132527 2013 20120329.0 4.0
CAFEPRESS INC 132528 2014 20120329.0 4.0
CAFEPRESS INC 132529 2015 20120329.0 4.0
CAFEPRESS INC 132530 2016 20120329.0 4.0
CAI INTERNATIONAL INC 120049 2005 20070516.0 -1.0
CAI INTERNATIONAL INC 120050 2006 20070516.0 -1.0
CALAMP CORP 3897 2000 NaN NaN
CALAMP CORP 3898 2001 NaN NaN
CALAMP CORP 3896 1999 NaN NaN
CALAMP CORP 3899 2002 NaN NaN
CALATLANTIC GROUP INC 21120 1995 NaN NaN
CALATLANTIC GROUP INC 21121 1996 NaN NaN
CALATLANTIC GROUP INC 21122 1997 NaN NaN
CALATLANTIC GROUP INC 21123 1998 NaN NaN
CALATLANTIC GROUP INC 21124 1999 NaN NaN
CALATLANTIC GROUP INC 21125 2000 NaN NaN
CALATLANTIC GROUP INC 21126 2001 NaN NaN
CALATLANTIC GROUP INC 21127 2002 NaN NaN
CALATLANTIC GROUP INC 21128 2003 NaN NaN

Transposing Data in Python

I have data from the World Bank that look like this:
Country Name Country Code 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Aruba ABW 80326 83195 85447 87276 89004 90858 92894 94995 97015 98742 100031 100830 101218 101342 101416 101597 101936 102393 102921 103441 103889
It's population data from 250 some countries and I've just shown the first one for the sake of example. How would I be able to transpose this so that each country and year are on a single row like this?
Country Name Country Code Year Population
Aruba ABW 1995 80326
Aruba ABW 1996 83195
Aruba ABW 1997 85447
Aruba ABW 1998 87276
And so on and so forth
You could use pd.melt.
pd.melt(df, id_vars=['Country Name', 'Country Code'],
var_name='Year', value_name='Population')
Or alternatively, could add the Country Name and Country Code to the index, stack, then reset the index
df = df.set_index(['Country Name', 'Country Code']).stack().reset_index()
but then you'll have to set the column names post-process. pd.melt is probably nicer for this, and is most likely faster as well.
Demo
>>> pd.melt(df, id_vars=['Country Name', 'Country Code'],
var_name='Year', value_name='Population')
Country Name Country Code Year Population
0 Aruba ABW 1995 80326
1 Aruba ABW 1996 83195
2 Aruba ABW 1997 85447
3 Aruba ABW 1998 87276
4 Aruba ABW 1999 89004
5 Aruba ABW 2000 90858
6 Aruba ABW 2001 92894
7 Aruba ABW 2002 94995
8 Aruba ABW 2003 97015
9 Aruba ABW 2004 98742
10 Aruba ABW 2005 100031
11 Aruba ABW 2006 100830
12 Aruba ABW 2007 101218
13 Aruba ABW 2008 101342
14 Aruba ABW 2009 101416
15 Aruba ABW 2010 101597
16 Aruba ABW 2011 101936
17 Aruba ABW 2012 102393
18 Aruba ABW 2013 102921
19 Aruba ABW 2014 103441
20 Aruba ABW 2015 103889

Categories