Transposing Data in Python - python

I have data from the World Bank that look like this:
Country Name Country Code 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015
Aruba ABW 80326 83195 85447 87276 89004 90858 92894 94995 97015 98742 100031 100830 101218 101342 101416 101597 101936 102393 102921 103441 103889
It's population data from 250 some countries and I've just shown the first one for the sake of example. How would I be able to transpose this so that each country and year are on a single row like this?
Country Name Country Code Year Population
Aruba ABW 1995 80326
Aruba ABW 1996 83195
Aruba ABW 1997 85447
Aruba ABW 1998 87276
And so on and so forth

You could use pd.melt.
pd.melt(df, id_vars=['Country Name', 'Country Code'],
var_name='Year', value_name='Population')
Or alternatively, could add the Country Name and Country Code to the index, stack, then reset the index
df = df.set_index(['Country Name', 'Country Code']).stack().reset_index()
but then you'll have to set the column names post-process. pd.melt is probably nicer for this, and is most likely faster as well.
Demo
>>> pd.melt(df, id_vars=['Country Name', 'Country Code'],
var_name='Year', value_name='Population')
Country Name Country Code Year Population
0 Aruba ABW 1995 80326
1 Aruba ABW 1996 83195
2 Aruba ABW 1997 85447
3 Aruba ABW 1998 87276
4 Aruba ABW 1999 89004
5 Aruba ABW 2000 90858
6 Aruba ABW 2001 92894
7 Aruba ABW 2002 94995
8 Aruba ABW 2003 97015
9 Aruba ABW 2004 98742
10 Aruba ABW 2005 100031
11 Aruba ABW 2006 100830
12 Aruba ABW 2007 101218
13 Aruba ABW 2008 101342
14 Aruba ABW 2009 101416
15 Aruba ABW 2010 101597
16 Aruba ABW 2011 101936
17 Aruba ABW 2012 102393
18 Aruba ABW 2013 102921
19 Aruba ABW 2014 103441
20 Aruba ABW 2015 103889

Related

Python pandas Dataframe column to rows manipulation [duplicate]

This question already has answers here:
Pandas Melt Function
(2 answers)
Closed 1 year ago.
I'm trying to transpose a few columns while keeping the other columns. I'm having a hard time with pivot codes or transpose codes as it doesn't really give me the output I need.
Can anyone help?
I have this data frame:
EmpID
Goal
week 1
week 2
week 3
week 4
1
556
54
33
24
54
2
342
32
32
56
43
3
534
43
65
64
21
4
244
45
87
5
22
My expected dataframe output is:
EmpID
Goal
Weeks
Actual
1
556
week 1
54
1
556
week 2
33
1
556
week 3
24
1
556
week 4
54
and so on until the full employee IDs are listed..
Something like this.
# Python - melt DF
import pandas as pd
d = {'Country Code': [1960, 1961, 1962, 1963, 1964, 1965, 1966],
'ABW': [2.615300, 2.734390, 2.678430, 2.929920, 2.963250, 3.060540, 4.349760],
'AFG': [0.249760, 0.218480, 0.210840, 0.217240, 0.211410, 0.209910, 0.671330],
'ALB': ['NaN', 'NaN', 'NaN', 'NaN', 'NaN', 'NaN', 1.12214]}
df = pd.DataFrame(data=d)
print(df)
df1 = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
print(df1)
df2 = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print(df2)
# BEFORE
ABW AFG ALB Country Code
0 2.61530 0.24976 NaN 1960
1 2.73439 0.21848 NaN 1961
2 2.67843 0.21084 NaN 1962
3 2.92992 0.21724 NaN 1963
4 2.96325 0.21141 NaN 1964
5 3.06054 0.20991 NaN 1965
6 4.34976 0.67133 1.12214 1966
# AFTER
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Country Code Year Econometric_Metric
0 1960 ABW 2.6153
1 1960 AFG 0.24976
2 1960 ALB NaN
3 1961 ABW 2.73439
4 1961 AFG 0.21848
5 1961 ALB NaN
6 1962 ABW 2.67843
7 1962 AFG 0.21084
8 1962 ALB NaN
9 1963 ABW 2.92992
10 1963 AFG 0.21724
11 1963 ALB NaN
12 1964 ABW 2.96325
13 1964 AFG 0.21141
14 1964 ALB NaN
15 1965 ABW 3.06054
16 1965 AFG 0.20991
17 1965 ALB NaN
18 1966 ABW 4.34976
19 1966 AFG 0.67133
20 1966 ALB 1.12214
Also, take a look at the link below, for more info.
https://www.dataindependent.com/pandas/pandas-melt/

Applying rolling median across row for pandas dataframe

I would like to apply a rolling median to replace NaN values in the following dataframe, with a window size of 3:
1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 ... 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
17 366000.0 278000.0 330000.0 NaN 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 635300.0 690600.0 800000.0 NaN 821500.0 ... 850800.0 905000.0 947500.0 1016500.0 1043900.0 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 NaN NaN
However pandas rolling function seems to work for columns and not along a row. How can i fix this? Also, the solution should NOT change any of the non NAN values in that row
First compute the rolling medians by using rolling() with axis=1 (row-wise), min_periods=0 (to handle NaN), and closed='both' (otherwise left edge gets excluded).
Then replace only the NaN entries with these medians by using fillna().
medians = df.rolling(3, min_periods=0, closed='both', axis=1).median()
df = df.fillna(medians)
# 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ... 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021
# 17 366000.0 278000.0 330000.0 330000.0 434000.0 470600.0 433000.0 456000.0 556300.0 580200.0 ... 1112800.0 1281900.0 1312700.0 1422000.0 1526900.0 1580000.0 1599000.0 1580000.0 1580000.0 1589500.0

Pandas DataFrame: Dropping rows after meeting conditions in columns

I have a large panel data in a pandas DataFrame:
import pandas as pd
df = pd.read_csv('Qs_example_data.csv')
df.head()
ID Year DOB status YOD
223725 1991 1975.0 No 2021
223725 1992 1975.0 No 2021
223725 1993 1975.0 No 2021
223725 1994 1975.0 No 2021
223725 1995 1975.0 No 2021
I want to drop the rows based on the following condition:
If the value in YOD matches the value in Year then all rows after that matching row for that ID are dropped, or if a Yes is observed in the column status for that ID.
For example in the DataFrame, ID 68084329 has the values 2012 in the DOB and YOD columns on row 221930. All rows after 221930 for 68084329 should be dropped.
df.loc[x['ID'] == 68084329]
ID Year DOB status YOD
221910 68084329 1991 1942.0 No 2012
221911 68084329 1992 1942.0 No 2012
221912 68084329 1993 1942.0 No 2012
221913 68084329 1994 1942.0 No 2012
221914 68084329 1995 1942.0 No 2012
221915 68084329 1996 1942.0 No 2012
221916 68084329 1997 1942.0 No 2012
221917 68084329 1998 1942.0 No 2012
221918 68084329 1999 1942.0 No 2012
221919 68084329 2000 1942.0 No 2012
221920 68084329 2001 1942.0 No 2012
221921 68084329 2002 1942.0 No 2012
221922 68084329 2003 1942.0 No 2012
221923 68084329 2004 1942.0 No 2012
221924 68084329 2005 1942.0 No 2012
221925 68084329 2006 1942.0 No 2012
221926 68084329 2007 1942.0 No 2012
221927 68084329 2008 1942.0 No 2012
221928 68084329 2010 1942.0 No 2012
221929 68084329 2011 1942.0 No 2012
221930 68084329 2012 1942.0 Yes 2012
221931 68084329 2013 1942.0 No 2012
221932 68084329 2014 1942.0 No 2012
221933 68084329 2015 1942.0 No 2012
221934 68084329 2016 1942.0 No 2012
221935 68084329 2017 1942.0 No 2012
I have a lot of IDs that have rows which need to be dropped in accordance with the above condition. How do I do this?
The following code should also work:
result=df[0:0]
ids=[]
for i in df.ID:
if i not in ids:
ids.append(i)
for k in ids:
temp=df[df.ID==k]
for j in range(len(temp)):
result=pd.concat([result, temp.iloc[j:j+1, :]])
if temp.iloc[j, :]['status']=='Yes':
break
print(result)
This should do. From your wording, it wasn't clear whether you need to "drop all the rows after you encounter a Yes for that ID", or "just the rows you encounter a Yes in". I assumed that you need to "drop all the rows after you encounter a Yes for that ID".
import pandas as pd
def __get_nos__(df):
return df.iloc[0:(df['Status'] != 'Yes').values.argmin(), :]
df = pd.DataFrame()
df['ID'] = [12345678]*10 + [13579]*10
df['Year'] = list(range(2000, 2010))*2
df['DOB'] = list(range(2000, 2010))*2
df['YOD'] = list(range(2000, 2010))*2
df['Status'] = ['No']*5 + ['Yes']*5 + ['No']*7 + ['Yes']*3
""" df
ID Year DOB YOD Status
0 12345678 2000 2000 2000 No
1 12345678 2001 2001 2001 No
2 12345678 2002 2002 2002 No
3 12345678 2003 2003 2003 No
4 12345678 2004 2004 2004 No
5 12345678 2005 2005 2005 Yes
6 12345678 2006 2006 2006 Yes
7 12345678 2007 2007 2007 Yes
8 12345678 2008 2008 2008 Yes
9 12345678 2009 2009 2009 Yes
10 13579 2000 2000 2000 No
11 13579 2001 2001 2001 No
12 13579 2002 2002 2002 No
13 13579 2003 2003 2003 No
14 13579 2004 2004 2004 No
15 13579 2005 2005 2005 No
16 13579 2006 2006 2006 No
17 13579 2007 2007 2007 Yes
18 13579 2008 2008 2008 Yes
19 13579 2009 2009 2009 Yes
"""
df.groupby('ID').apply(lambda x: __get_nos__(x)).reset_index(drop=True)
""" Output
ID Year DOB YOD Status
0 13579 2000 2000 2000 No
1 13579 2001 2001 2001 No
2 13579 2002 2002 2002 No
3 13579 2003 2003 2003 No
4 13579 2004 2004 2004 No
5 13579 2005 2005 2005 No
6 13579 2006 2006 2006 No
7 12345678 2000 2000 2000 No
8 12345678 2001 2001 2001 No
9 12345678 2002 2002 2002 No
10 12345678 2003 2003 2003 No
11 12345678 2004 2004 2004 No
"""

How to convert multiple rows into stacked columns in excel /python?

Is there an easy way to convert from Type A to Type B.
Note : Kutools (Plugin in Excel) provides a solution for it but that is not robust and does not seem scalable.
Any workaround for this ?
Considering you can make the df look like below : (just remove the top row which says Type A)
GDP per capita 1950 1951 1952 1953
0 Antigua and Barbuda 3544 3633 3723 3817
1 Argentina 7540 7612 7019 7198
2 Armenia 1862 1834 1914 1958
3 Aruba 3897 3994 4094 4196
4 Australia 12073 12229 12084 12228
5 Austria 6919 7382 7386 7692
Using pd.melt()
>>pd.melt(df,id_vars='GDP per capita',var_name='Year',value_name='GDP Value')
GDP per capita Year GDP Value
0 Antigua and Barbuda 1950 3544
1 Argentina 1950 7540
2 Armenia 1950 1862
3 Aruba 1950 3897
4 Australia 1950 12073
5 Austria 1950 6919
6 Antigua and Barbuda 1951 3633
7 Argentina 1951 7612
8 Armenia 1951 1834
9 Aruba 1951 3994
10 Australia 1951 12229
11 Austria 1951 7382
12 Antigua and Barbuda 1952 3723
13 Argentina 1952 7019
14 Armenia 1952 1914
15 Aruba 1952 4094
16 Australia 1952 12084
17 Austria 1952 7386
18 Antigua and Barbuda 1953 3817
19 Argentina 1953 7198
20 Armenia 1953 1958
21 Aruba 1953 4196
22 Australia 1953 12228
23 Austria 1953 7692
To get the exact look like the image you have posted use:
df1=pd.melt(df,id_vars='GDP per capita',var_name='Year',value_name='GDP Value')
df1.rename(columns={'GDP per capita':'Country'},inplace=True)
df1['GDP'] = 'GDP per capita'
df1 = df1[['GDP','Country','Year','GDP Value']]
df1.to_csv('filepath+filename.csv,index=False)

Transpose and widen Data

My panda data frame looks like as follows:
Country Code 1960 1961 1962 1963 1964 1965 1966 1967 1968 ... 2015
ABW 2.615300 2.734390 2.678430 2.929920 2.963250 3.060540 ... 4.349760
AFG 0.249760 0.218480 0.210840 0.217240 0.211410 0.209910 ... 0.671330
ALB NaN NaN NaN NaN NaN NaN NaN NaN NaN ... 1.12214
...
How can I transpose it that it looks like as follows?
Country_Code Year Econometric_Metric
ABW 1960 2.615300
ABW 1961 2.734390
ABW 1962 2.678430
...
ABW 2015 4.349760
AFG 1960 0.249760
AFG 1961 0.218480
AFG 1962 0.210840
...
AFG 2015 0.671330
ALB 1960 NaN
ALB 1961 NaN
ALB 1962 NaN
ALB 2015 1.12214
...
Thanks.
I think need melt with sort_values:
df = (df.melt(['Country Code'], var_name='Year', value_name='Econometric_Metric')
.sort_values(['Country Code','Year'])
.reset_index(drop=True))
Or set_index with stack:
df = (df.set_index(['Country Code'])
.stack(dropna=False)
.reset_index(name='Econometric_Metric')
.rename(columns={'level_1':'Year'}))
print (df.head(10))
Country Code Year Econometric_Metric
0 ABW 1960 2.61530
1 ABW 1961 2.73439
2 ABW 1962 2.67843
3 ABW 1963 2.92992
4 ABW 1964 2.96325
5 ABW 1965 3.06054
6 ABW 1966 NaN
7 ABW 1967 NaN
8 ABW 1968 NaN
9 ABW 2015 4.34976

Categories