Pandas Melt with Multiple Value Vars - python
I have a data set which is in wide format like this
Index Country Variable 2000 2001 2002 2003 2004 2005
0 Argentina var1 12 15 18 17 23 29
1 Argentina var2 1 3 2 5 7 5
2 Brazil var1 20 23 25 29 31 32
3 Brazil var2 0 1 2 2 3 3
I want to reshape my data to long so that year, var1, and var2 become new columns
Index Country year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
....
6 Brazil 2000 20 0
7 Brazil 2001 23 1
I got my code to work when I only had one variable by writing
df=(pd.melt(df,id_vars='Country',value_name='Var1', var_name='year'))
I cant figure out how to do this for a var1,var2, var3, etc.
Instead of melt, you can use a combination of stack and unstack:
(df.set_index(['Country', 'Variable'])
.rename_axis(['Year'], axis=1)
.stack()
.unstack('Variable')
.reset_index())
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Option 1
Using melt then unstack for var1, var2, etc...
(df1.melt(id_vars=['Country','Variable'],var_name='Year')
.set_index(['Country','Year','Variable'])
.squeeze()
.unstack()
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Option 2
Using pivot then stack:
(df1.pivot(index='Country',columns='Variable')
.stack(0)
.rename_axis(['Country','Year'])
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Option 3 (ayhan's solution)
Using set_index, stack, and unstack:
(df.set_index(['Country', 'Variable'])
.rename_axis(['Year'], axis=1)
.stack()
.unstack('Variable')
.reset_index())
Output:
Variable Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
numpy
years = df.drop(['Country', 'Variable'], 1)
y = years.values
m = y.shape[1]
c = df.Country.values
v = df.Variable.values
f0, u0 = pd.factorize(df.Country.values)
f1, u1 = pd.factorize(df.Variable.values)
w = np.empty((u1.size, u0.size, m), dtype=y.dtype)
w[f1, f0] = y
results = pd.DataFrame(dict(
Country=u0.repeat(m),
Year=np.tile(years.columns.values, u0.size),
)).join(pd.DataFrame(w.reshape(-1, m * u1.size).T, columns=u1))
results
Country Year var1 var2
0 Argentina 2000 12 1
1 Argentina 2001 15 3
2 Argentina 2002 18 2
3 Argentina 2003 17 5
4 Argentina 2004 23 7
5 Argentina 2005 29 5
6 Brazil 2000 20 0
7 Brazil 2001 23 1
8 Brazil 2002 25 2
9 Brazil 2003 29 2
10 Brazil 2004 31 3
11 Brazil 2005 32 3
Related
How to create multiple triangles based on given number of simulations?
Below is my code: triangle = cl.load_sample('genins') # Use bootstrap sampler to get resampled triangles bootstrapdataframe = cl.BootstrapODPSample(n_sims=4, random_state=42).fit(triangle).resampled_triangles_ #converting to dataframe resampledtriangledf = bootstrapdataframe.to_frame() print(resampledtriangledf) In above code i mentioned n_sims(number of simulation)=4. So it generates below datafame: 0 2001 12 254,926 0 2001 24 535,877 0 2001 36 1,355,613 0 2001 48 2,034,557 0 2001 60 2,311,789 0 2001 72 2,539,807 0 2001 84 2,724,773 0 2001 96 3,187,095 0 2001 108 3,498,646 0 2001 120 3,586,037 0 2002 12 542,369 0 2002 24 1,016,927 0 2002 36 2,201,329 0 2002 48 2,923,381 0 2002 60 3,711,305 0 2002 72 3,914,829 0 2002 84 4,385,757 0 2002 96 4,596,072 0 2002 108 5,047,861 0 2003 12 235,361 0 2003 24 960,355 0 2003 36 1,661,972 0 2003 48 2,643,370 0 2003 60 3,372,684 0 2003 72 3,642,605 0 2003 84 4,160,583 0 2003 96 4,480,332 0 2004 12 764,553 0 2004 24 1,703,557 0 2004 36 2,498,418 0 2004 48 3,198,358 0 2004 60 3,524,562 0 2004 72 3,884,971 0 2004 84 4,268,241 0 2005 12 381,670 0 2005 24 1,124,054 0 2005 36 2,026,434 0 2005 48 2,863,902 0 2005 60 3,039,322 0 2005 72 3,288,253 0 2006 12 320,332 0 2006 24 1,022,323 0 2006 36 1,830,842 0 2006 48 2,676,710 0 2006 60 3,375,172 0 2007 12 330,361 0 2007 24 1,463,348 0 2007 36 2,771,839 0 2007 48 4,003,745 0 2008 12 282,143 0 2008 24 1,782,267 0 2008 36 2,898,699 0 2009 12 362,726 0 2009 24 1,277,750 0 2010 12 321,247 1 2001 12 219,021 1 2001 24 755,975 1 2001 36 1,360,298 1 2001 48 2,062,947 1 2001 60 2,356,983 1 2001 72 2,781,187 1 2001 84 2,987,837 1 2001 96 3,118,952 1 2001 108 3,307,522 1 2001 120 3,455,107 1 2002 12 302,932 1 2002 24 1,022,459 1 2002 36 1,634,938 1 2002 48 2,538,708 1 2002 60 3,005,695 1 2002 72 3,274,719 1 2002 84 3,356,499 1 2002 96 3,595,361 1 2002 108 4,100,065 1 2003 12 489,934 1 2003 24 1,233,438 1 2003 36 2,471,849 1 2003 48 3,672,629 1 2003 60 4,157,489 1 2003 72 4,498,470 1 2003 84 4,587,579 1 2003 96 4,816,232 1 2004 12 518,680 1 2004 24 1,209,705 1 2004 36 2,019,757 1 2004 48 2,997,820 1 2004 60 3,630,442 1 2004 72 3,881,093 1 2004 84 4,080,322 1 2005 12 453,963 1 2005 24 1,458,504 1 2005 36 2,036,506 1 2005 48 2,846,464 1 2005 60 3,280,124 1 2005 72 3,544,597 1 2006 12 369,755 1 2006 24 1,209,117 1 2006 36 1,973,136 1 2006 48 3,034,294 1 2006 60 3,537,784 1 2007 12 477,788 1 2007 24 1,524,537 1 2007 36 2,170,391 1 2007 48 3,355,093 1 2008 12 250,690 1 2008 24 1,546,986 1 2008 36 2,996,737 1 2009 12 271,270 1 2009 24 1,446,353 1 2010 12 510,114 2 2001 12 170,866 2 2001 24 797,338 2 2001 36 1,663,610 2 2001 48 2,293,697 2 2001 60 2,607,067 2 2001 72 2,979,479 2 2001 84 3,127,308 2 2001 96 3,285,338 2 2001 108 3,574,272 2 2001 120 3,630,610 2 2002 12 259,060 2 2002 24 1,011,092 2 2002 36 1,851,504 2 2002 48 2,705,313 2 2002 60 3,195,774 2 2002 72 3,766,008 2 2002 84 3,944,417 2 2002 96 4,234,043 2 2002 108 4,763,664 2 2003 12 239,981 2 2003 24 983,484 2 2003 36 1,929,785 2 2003 48 2,497,929 2 2003 60 2,972,887 2 2003 72 3,313,868 2 2003 84 3,727,432 2 2003 96 4,024,122 2 2004 12 77,522 2 2004 24 729,401 2 2004 36 1,473,914 2 2004 48 2,376,313 2 2004 60 2,999,197 2 2004 72 3,372,020 2 2004 84 3,887,883 2 2005 12 321,598 2 2005 24 1,132,502 2 2005 36 1,710,504 2 2005 48 2,438,620 2 2005 60 2,801,957 2 2005 72 3,182,466 2 2006 12 255,407 2 2006 24 1,275,141 2 2006 36 2,083,421 2 2006 48 3,144,579 2 2006 60 3,891,772 2 2007 12 338,120 2 2007 24 1,275,697 2 2007 36 2,238,715 2 2007 48 3,615,323 2 2008 12 310,214 2 2008 24 1,237,156 2 2008 36 2,563,326 2 2009 12 271,093 2 2009 24 1,523,131 2 2010 12 430,591 3 2001 12 330,887 3 2001 24 831,193 3 2001 36 1,601,374 3 2001 48 2,188,879 3 2001 60 2,662,773 3 2001 72 3,086,976 3 2001 84 3,332,247 3 2001 96 3,317,279 3 2001 108 3,576,659 3 2001 120 3,613,563 3 2002 12 358,263 3 2002 24 1,139,259 3 2002 36 2,236,375 3 2002 48 3,163,464 3 2002 60 3,715,130 3 2002 72 4,295,638 3 2002 84 4,502,105 3 2002 96 4,769,139 3 2002 108 5,323,304 3 2003 12 489,934 3 2003 24 1,570,352 3 2003 36 3,123,215 3 2003 48 4,189,299 3 2003 60 4,819,070 3 2003 72 5,306,689 3 2003 84 5,560,371 3 2003 96 5,827,003 3 2004 12 419,727 3 2004 24 1,308,884 3 2004 36 2,118,936 3 2004 48 2,906,732 3 2004 60 3,561,577 3 2004 72 3,934,400 3 2004 84 4,010,511 3 2005 12 389,217 3 2005 24 1,173,226 3 2005 36 1,794,216 3 2005 48 2,528,910 3 2005 60 3,474,035 3 2005 72 3,908,999 3 2006 12 291,940 3 2006 24 1,136,674 3 2006 36 1,915,614 3 2006 48 2,693,930 3 2006 60 3,375,601 3 2007 12 506,055 3 2007 24 1,684,660 3 2007 36 2,678,739 3 2007 48 3,545,156 3 2008 12 282,143 3 2008 24 1,536,490 3 2008 36 2,458,789 3 2009 12 271,093 3 2009 24 1,199,897 3 2010 12 266,359 Using above dataframe I have to create 4 triangles based on Toatal column: For example: Row Labels 12 24 36 48 60 72 84 96 108 120 Grand Total 2001 254,926 535,877 1,355,613 2,034,557 2,311,789 2,539,807 2,724,773 3,187,095 3,498,646 3,586,037 22,029,119 2002 542,369 1,016,927 2,201,329 2,923,381 3,711,305 3,914,829 4,385,757 4,596,072 5,047,861 28,339,832 2003 235,361 960,355 1,661,972 2,643,370 3,372,684 3,642,605 4,160,583 4,480,332 21,157,261 2004 764,553 1,703,557 2,498,418 3,198,358 3,524,562 3,884,971 4,268,241 19,842,659 2005 381,670 1,124,054 2,026,434 2,863,902 3,039,322 3,288,253 12,723,635 2006 320,332 1,022,323 1,830,842 2,676,710 3,375,172 9,225,377 2007 330,361 1,463,348 2,771,839 4,003,745 8,569,294 2008 282,143 1,782,267 2,898,699 4,963,110 2009 362,726 1,277,750 1,640,475 2010 321,247 321,247 Grand Total 3,795,687 10,886,456 17,245,147 20,344,022 19,334,833 17,270,466 15,539,355 12,263,499 8,546,507 3,586,037 128,812,009 . . . Like this i need 4 triangles (4 is number of simulation) using 1st dataframe. If user gives s_sims=900 then it creates 900 totals values based on this we have to create 900 triangles. In above triangle i just displayed only 1 triangle for 0th value. But i neet triangle for 1 ,2 and 3 also.
Try: df['sample_size'] = pd.to_numeric(df['sample_size'].str.replace(',','')) df.pivot_table('sample_size','year', 'no', aggfunc='first')\ .pipe(lambda x: pd.concat([x,x.sum().to_frame('Grand Total').T])) Output: no 12 24 36 48 60 72 84 96 108 120 2001 254926.0 535877.0 1355613.0 2034557.0 2311789.0 2539807.0 2724773.0 3187095.0 3498646.0 3586037.0 2002 542369.0 1016927.0 2201329.0 2923381.0 3711305.0 3914829.0 4385757.0 4596072.0 5047861.0 NaN 2003 235361.0 960355.0 1661972.0 2643370.0 3372684.0 3642605.0 4160583.0 4480332.0 NaN NaN 2004 764553.0 1703557.0 2498418.0 3198358.0 3524562.0 3884971.0 4268241.0 NaN NaN NaN 2005 381670.0 1124054.0 2026434.0 2863902.0 3039322.0 3288253.0 NaN NaN NaN NaN 2006 320332.0 1022323.0 1830842.0 2676710.0 3375172.0 NaN NaN NaN NaN NaN 2007 330361.0 1463348.0 2771839.0 4003745.0 NaN NaN NaN NaN NaN NaN 2008 282143.0 1782267.0 2898699.0 NaN NaN NaN NaN NaN NaN NaN 2009 362726.0 1277750.0 NaN NaN NaN NaN NaN NaN NaN NaN 2010 321247.0 NaN NaN NaN NaN NaN NaN NaN NaN NaN Grand Total 3795688.0 10886458.0 17245146.0 20344023.0 19334834.0 17270465.0 15539354.0 12263499.0 8546507.0 3586037.0
Convert a Python data frame with diferents 'year' column into continue time series
It is posible to convert a dataframe on Pandas like that: Into a time series where each year its behind the last one
This is likely what df.unstack(level=1) is meant for. np.random.seed(111) # reproducibility df = pd.DataFrame( data={ "2009": np.random.randn(12), "2010": np.random.randn(12), "2011": np.random.randn(12), }, index=range(1, 13) ) print(df) Out[45]: 2009 2010 2011 1 -1.133838 -1.440585 0.570594 2 0.384319 0.773703 0.915420 3 1.496554 -1.027967 -1.669341 4 -0.355382 -0.090986 0.482714 5 -0.787534 0.492003 -0.310473 6 -0.459439 0.424672 2.394690 7 -0.059169 1.283049 1.550931 8 -0.354174 0.315986 -0.646465 9 -0.735523 -0.408082 -0.928937 10 -1.183940 -0.067948 -1.654976 11 0.238894 -0.952427 0.350193 12 -0.589920 -0.110677 -0.141757 df_out = df.unstack(1).reset_index() df_out.columns = ["year", "month", "value"] print(df_out) Out[46]: year month value 0 2009 1 -1.133838 1 2009 2 0.384319 2 2009 3 1.496554 3 2009 4 -0.355382 4 2009 5 -0.787534 5 2009 6 -0.459439 6 2009 7 -0.059169 7 2009 8 -0.354174 8 2009 9 -0.735523 9 2009 10 -1.183940 10 2009 11 0.238894 11 2009 12 -0.589920 12 2010 1 -1.440585 13 2010 2 0.773703 14 2010 3 -1.027967 15 2010 4 -0.090986 16 2010 5 0.492003 17 2010 6 0.424672 18 2010 7 1.283049 19 2010 8 0.315986 20 2010 9 -0.408082 21 2010 10 -0.067948 22 2010 11 -0.952427 23 2010 12 -0.110677 24 2011 1 0.570594 25 2011 2 0.915420 26 2011 3 -1.669341 27 2011 4 0.482714 28 2011 5 -0.310473 29 2011 6 2.394690 30 2011 7 1.550931 31 2011 8 -0.646465 32 2011 9 -0.928937 33 2011 10 -1.654976 34 2011 11 0.350193 35 2011 12 -0.141757
creating an array basedon dataframe values
If i have an id that has a start date = 1988 and end date as = 2018 and value = 21100 and i want to create a array or dataframe of the dates from 1988 - 2018 i.e (1988,1989,1990...2018) with each date = to the same value of 21100 So i basically want something that looks like: date, id1, id2 1988, 21100,0 1989, 21100,0 1990,21000 ,0 ... 1994,21100,4598 ... 2013,21100,4598 ... 2018,21100,0 how could i do this? I want the array to start populating the value based on the start date and to end populating based on the end date. i have multiple id's (268) and i want them to loop through each adding a new column (id2, id3 ... id268). So for example id2 starts at 1994 to 2013 with a value of 4598.
EDIT: example = pd.DataFrame({ 'id': ['id1', 'id2', 'id3', 'id4'], 'start date': ['1988', '1988', '2000', '2005'], 'end date': ['2018', '2013', '2005', '2017'], 'value': [2100, 4568, 7896, 68909] }) print (example) id start date end date value 0 id1 1988 2018 2100 1 id2 1988 2013 4568 2 id3 2000 2005 7896 3 id4 2005 2017 68909 You can create Series in list comprehension and join them together by concat, replace missing values, convert to integers and last convert index to column Date : L = [pd.Series(v, index=range(int(s), int(e)+1)) for s,e,v in zip(example['start date'], example['end date'], example['value'])] df1 = (pd.concat(L, axis=1, keys=example['id']) .fillna(0) .astype(int) .rename_axis('date') .reset_index()) print (df1) id date id1 id2 id3 id4 0 1988 2100 4568 0 0 1 1989 2100 4568 0 0 2 1990 2100 4568 0 0 3 1991 2100 4568 0 0 4 1992 2100 4568 0 0 5 1993 2100 4568 0 0 6 1994 2100 4568 0 0 7 1995 2100 4568 0 0 8 1996 2100 4568 0 0 9 1997 2100 4568 0 0 10 1998 2100 4568 0 0 11 1999 2100 4568 0 0 12 2000 2100 4568 7896 0 13 2001 2100 4568 7896 0 14 2002 2100 4568 7896 0 15 2003 2100 4568 7896 0 16 2004 2100 4568 7896 0 17 2005 2100 4568 7896 68909 18 2006 2100 4568 0 68909 19 2007 2100 4568 0 68909 20 2008 2100 4568 0 68909 21 2009 2100 4568 0 68909 22 2010 2100 4568 0 68909 23 2011 2100 4568 0 68909 24 2012 2100 4568 0 68909 25 2013 2100 4568 0 68909 26 2014 2100 0 0 68909 27 2015 2100 0 0 68909 28 2016 2100 0 0 68909 29 2017 2100 0 0 68909 30 2018 2100 0 0 0 Use DataFrame constructor with range: start = 1988 end = 2019 val = 21100 df = pd.DataFrame({'date':range(start, end), 'id1': val}) print (df) date id1 0 1988 21100 1 1989 21100 2 1990 21100 3 1991 21100 4 1992 21100 5 1993 21100 6 1994 21100 7 1995 21100 8 1996 21100 9 1997 21100 10 1998 21100 11 1999 21100 12 2000 21100 13 2001 21100 14 2002 21100 15 2003 21100 16 2004 21100 17 2005 21100 18 2006 21100 19 2007 21100 20 2008 21100 21 2009 21100 22 2010 21100 23 2011 21100 24 2012 21100 25 2013 21100 26 2014 21100 27 2015 21100 28 2016 21100 29 2017 21100 30 2018 21100
How replaces values from a column in a Panel Data with values from a list in Python?
I have a database in panel data form: Date id variable1 variable2 2015 1 10 200 2016 1 17 300 2017 1 8 400 2018 1 11 500 2015 2 12 150 2016 2 19 350 2017 2 15 250 2018 2 9 450 2015 3 20 100 2016 3 8 220 2017 3 12 310 2018 3 14 350 And I have a list with the labels of the ID List = ['Argentina', 'Brazil','Chile'] I want to replace values of id with labels from my list. Thanks in advance Date id variable1 variable2 2015 Argentina 10 200 2016 Argentina 17 300 2017 Argentina 8 400 2018 Argentina 11 500 2015 Brazil 12 150 2016 Brazil 19 350 2017 Brazil 15 250 2018 Brazil 9 450 2015 Chile 20 100 2016 Chile 8 220 2017 Chile 12 310 2018 Chile 14 350
map is the way to go, with enumerate: d = {k:v for k,v in enumerate(List, start=1)} df['id'] = df['id'].map(d) Output: Date id variable1 variable2 0 2015 Argentina 10 200 1 2016 Argentina 17 300 2 2017 Argentina 8 400 3 2018 Argentina 11 500 4 2015 Brazil 12 150 5 2016 Brazil 19 350 6 2017 Brazil 15 250 7 2018 Brazil 9 450 8 2015 Chile 20 100 9 2016 Chile 8 220 10 2017 Chile 12 310 11 2018 Chile 14 350
Try df['id'] = df['id'].map({1: 'Argentina', 2: 'Brazil', 3: 'Chile'}) or df['id'] = df['id'].map({k+1: v for k, v in enumerate(List)})
Adding columns of different length into pandas dataframe
I have a dataframe detailing money awarded to people over several years: Name -- Money -- Year Paul 57.00 2012 Susan 67.00 2012 Gary 54.00 2011 Paul 77.00 2011 Andrea 20.00 2011 Albert 23.00 2011 Hal 26.00 2010 Paul 23.00 2010 From this dataframe, I want to construct a dataframe that details all the money awarded in a single year, for making a boxplot: 2012 -- 2011 -- 2010 57.00 54.00 26.00 67.00 77.00 23.00 20.00 23.00 So you see this results in columns of different length. When I try to do this using pandas, I get the error 'ValueError: Length of values does not match length of index'. I assume this is because I can't add varying length columns to a dataframe. Can anyone offer some advice on how to proceed? Perhap I'm approaching this incorrectly? Thanks for any help!
I'd do this in a two-step process: first add a column corresponding to the index in each year using cumcount, and then pivot so that the new column is the index, the years become the columns, and the money column becomes the values: df["yindex"] = df.groupby("Year").cumcount() new_df = df.pivot(index="yindex", columns="Year", values="Money") For example: >>> df = pd.read_csv("money.txt", sep="\s+") >>> df Name Money Year 0 Paul 57 2012 1 Susan 67 2012 2 Gary 54 2011 3 Paul 77 2011 4 Andrea 20 2011 5 Albert 23 2011 6 Hal 26 2010 7 Paul 23 2010 >>> df["yindex"] = df.groupby("Year").cumcount() >>> df Name Money Year yindex 0 Paul 57 2012 0 1 Susan 67 2012 1 2 Gary 54 2011 0 3 Paul 77 2011 1 4 Andrea 20 2011 2 5 Albert 23 2011 3 6 Hal 26 2010 0 7 Paul 23 2010 1 >>> df.pivot(index="yindex", columns="Year", values="Money") Year 2010 2011 2012 yindex 0 26 54 57 1 23 77 67 2 NaN 20 NaN 3 NaN 23 NaN After which you could get rid of the NaNs if you like, but it depends on whether you want to distinguish between cases like "knowing the value is 0" and "not knowing what the value is": >>> df.pivot(index="yindex", columns="Year", values="Money").fillna(0) Year 2010 2011 2012 yindex 0 26 54 57 1 23 77 67 2 0 20 0 3 0 23 0