I have a dataframe like this
import pandas as pd
year = [2005, 2006, 2007]
A = [4, 5, 7]
B = [3, 3, 9]
C = [1, 7, 6]
df_old = pd.DataFrame({'year' : year, 'A' : A, 'B' : B, 'C' : C})
Out[25]:
A B C year
0 4 3 1 2005
1 5 3 7 2006
2 7 9 6 2007
I want to transform this to a new dataframe where the column headers ´A´, ´B´ and ´C´ are in the rows. I have this hack, which sorta does the job
df_new = pd.DataFrame({'year' : list(df_old['year']) + list(df_old['year'])\
+ list(df_old['year']),
'col' : ['A']*len(df_old['A']) + ['B']*len(df_old['B'])\
+ ['C']*len(df_old['C']),
'val' : list(df_old['A']) + list(df_old['B'])\
+ list(df_old['C'])})
Out[27]:
col val year
0 A 4 2005
1 A 5 2006
2 A 7 2007
3 B 3 2005
4 B 3 2006
5 B 9 2007
6 C 1 2005
7 C 7 2006
8 C 6 2007
Is there a better, more compressed way to do this? Needless to say, this becomes cumbersome when there are a lot of columns.
Use melt:
print (df_old.melt('year', value_name='val', var_name='col'))
year col val
0 2005 A 4
1 2006 A 5
2 2007 A 7
3 2005 B 3
4 2006 B 3
5 2007 B 9
6 2005 C 1
7 2006 C 7
8 2007 C 6
and for reorder columns reindex:
df=df_old.melt('year',value_name='val', var_name='col').reindex(columns=['col','val','year'])
print (df)
col val year
0 A 4 2005
1 A 5 2006
2 A 7 2007
3 B 3 2005
4 B 3 2006
5 B 9 2007
6 C 1 2005
7 C 7 2006
8 C 6 2007
Related
I have a dataset of the following form.
id year
0 A 2000
1 A 2001
2 B 2005
3 B 2006
4 B 2007
5 C 2003
6 C 2004
7 D 2002
8 D 2003
Now two or more IDs are assumed to be part of an aggregated ID if they can be arranged in a consecutive order. Meaning that in the end I would like to have this grouping, in which A & D build a group and B & C another one:
id year match
0 A 2000 1
1 A 2001 1
7 D 2002 1
8 D 2003 1
5 C 2003 2
6 C 2004 2
2 B 2005 2
3 B 2006 2
4 B 2007 2
EDIT: Addressing #Dimitris_ps comments: Assuming an additional row
id year
9 A 2002
would change the desired result to
id year match
0 A 2000 1
1 A 2001 1
9 A 2002 1
5 C 2003 1
6 C 2004 1
2 B 2005 1
3 B 2006 1
4 B 2007 1
7 D 2002 2
8 D 2003 2
because now there is no longer a consecutive order for A & D but instead for A, C, and B with D having no match.
Recode your id to values and then you can sort based on year and id.
import pandas as pd
df = pd.DataFrame({'id':['A', 'A', 'B', 'B', 'B', 'C', 'C', 'D', 'D'],
'year':[2000, 2001, 2005, 2006, 2007, 2003, 2004, 2002, 2003]}) # example dataframe
# Create a dict mapping id to values based on the minimum year
custom_dict = {el:i for i, el in enumerate(df.groupby('id')['year'].min().sort_values().index)}
# and the reverse to map back the values to the id
custom_dict_rev = {v:k for k, v in custom_dict.items()}
df['id'] = df['id'].map(custom_dict)
df = df.sort_values(['year', 'id'])
df['id'] = df['id'].map(custom_dict_rev)
df
Following up to my previous question here:
import pandas as pd
d = pd.DataFrame({'value':['a', 'b'],'2019Q1':[1, 5], '2019Q2':[2, 6], '2019Q3':[3, 7]})
which displays like this:
value 2019Q1 2019Q2 2019Q3
0 a 1 2 3
1 b 5 6 7
How can I transform it into this shape:
Year measure Quarter Value
2019 a 1 1
2019 a 2 2
2019 a 3 3
2019 b 1 5
2019 b 2 6
2019 b 3 7
Use pd.wide_to_long with DataFrame.melt:
df2 = df.copy()
df2.columns = df.columns.str.split('Q').str[::-1].str.join('_')
new_df = (pd.wide_to_long(df2.rename(columns = {'value':'Measure'}),
['1','2','3'],
j="Year",
i = 'Measure',
sep='_')
.reset_index()
.melt(['Measure','Year'],var_name = 'Quarter',value_name = 'Value')
.loc[:,['Year','Measure','Quarter','Value']]
.sort_values(['Year','Measure','Quarter']))
print(new_df)
Year Measure Quarter Value
0 2019 a 1 1
2 2019 a 2 2
4 2019 a 3 3
1 2019 b 1 5
3 2019 b 2 6
5 2019 b 3 7
this is just an addition for future visitors : when u split columns and use expand=True, u get a multiindex. This allows reshaping using the stack method.
#set value column as index
d = d.set_index('value')
#split columns and convert to multiindex
d.columns = d.columns.str.split('Q',expand=True)
#reshape dataframe
d.stack([0,1]).rename_axis(['measure','year','quarter']).reset_index(name='Value')
measure year quarter Value
0 a 2019 1 1
1 a 2019 2 2
2 a 2019 3 3
3 b 2019 1 5
4 b 2019 2 6
5 b 2019 3 7
I am relatively new to python and pandas and I am trying to perform a group-wise operation using apply but struggle to get it working.
My data frame looks like this:
Year Country Val1 Val2 Fact
2005 A 1 3 1
2006 A 2 4 2
2007 A 3 5 2
2008 A 4 3 1
2009 A 4 3 1
2010 A 4 3 1
2005 B 5 7 2
2006 B 6 6 2
2007 B 7 5 1
2008 B 8 6 2
2009 B 8 6 2
2010 B 8 6 2
For each country in each year, I need to calculate
(country mean for period 2005-2008 - value in 2005)/4 * Fact * (Year - 2005) + value in 2005
So far I read up on the use of apply and transform and looked at questions related to the use of both functions (e.g.1 and 2) and I thought that my problem can be solved by using a group wise apply.
I tried to set it up like so:
import pandas as pd
df = pd.DataFrame({'Year' : [2005, 2006, 2007, 2008, 2009, 2010, 2005, 2006, 2007, 2008, 2009, 2010],
'Country' : ['A', 'A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'B', 'B', 'B'],
'Val1' : [1, 2, 3, 4, 4, 4, 5, 6, 7, 8, 8, 8],
'Val2' : [3, 4, 5, 3, 3, 3, 7, 6, 5, 6, 6, 6,],
'Fact' : [1, 2, 2, 1, 1, 1, 2, 2, 1, 2, 2, 2]
})
def func(grp):
grad = grp[(grp['Year'] > 2004) & (grp['Year'] < 2009)].transform('mean')
ref = grp[grp['Year'] == 2005]
grad = (grad - ref)/4
res = grad * grp['Fact'] * (grp['Year']-2015) * ref
return res
df.groupby('Country').apply(func)
Running the code yields
Country Fact Val1 Val2 Year 0 1 2 3 4 5 6 7 8 9 10 11
Country
A 0 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
B 6 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
However, I hoped to receive something along the line of this
Year Country Val1 Val2 Fact
2005 A 1 3 1
2006 A 1.75 3.375 2
2007 A 2.5 3.75 2
2008 A 2.125 3.5625 1
2009 A 2.125 3.5625 1
2010 A 2.125 3.5625 1
2005 B 5 7 2
2006 B 5.75 6.5 2
2007 B 5.75 6.5 1
2008 B 7.25 5.5 2
2009 B 7.25 5.5 2
2010 B 7.25 5.5 2
I would be very grateful if anybody could point me towards a solution for this problem.
It is better not do it within one function
s1=df.loc[df.Year.between(2005,2008)].groupby('Country').mean()[['Val1','Val2']]
s2=df.loc[df.Year.eq(2005),['Country','Val1','Val2']].set_index('Country')
s3=df.Year.sub(2005)*df.Fact
s=(s1-s2).div(4).reindex(df.Country).values*s3.values[:,None]+s2.reindex(df.Country).values
df.loc[:,['Val1','Val2']]=s
df
Year Country Val1 Val2 Fact
0 2005 A 1.000 3.0000 1
1 2006 A 1.750 3.3750 2
2 2007 A 2.500 3.7500 2
3 2008 A 2.125 3.5625 1
4 2009 A 2.500 3.7500 1
5 2010 A 2.875 3.9375 1
6 2005 B 5.000 7.0000 2
7 2006 B 5.750 6.5000 2
8 2007 B 5.750 6.5000 1
9 2008 B 7.250 5.5000 2
10 2009 B 8.000 5.0000 2
11 2010 B 8.750 4.5000 2
I have a dataframe:
df = pd.DataFrame([[2, 4, 7, 8, 1, 3, 2013], [9, 2, 4, 5, 5, 6, 2014]], columns=['Amy', 'Bob', 'Carl', 'Chris', 'Ben', 'Other', 'Year'])
Amy Bob Carl Chris Ben Other Year
0 2 4 7 8 1 3 2013
1 9 2 4 5 5 6 2014
And a dictionary:
d = {'A': ['Amy'], 'B': ['Bob', 'Ben'], 'C': ['Carl', 'Chris']}
I would like to reshape my dataframe to look like this:
Group Name Year Value
0 A Amy 2013 2
1 A Amy 2014 9
2 B Bob 2013 4
3 B Bob 2014 2
4 B Ben 2013 1
5 B Ben 2014 5
6 C Carl 2013 7
7 C Carl 2014 4
8 C Chris 2013 8
9 C Chris 2014 5
10 Other 2013 3
11 Other 2014 6
Note that Other doesn't have any values in the Name column and the order of the rows does not matter. I think I should be using the melt function but the examples that I've come across aren't too clear.
melt gets you part way there.
In [29]: m = pd.melt(df, id_vars=['Year'], var_name='Name')
This has everything except Group. To get that, we need to reshape d a bit as well.
In [30]: d2 = {}
In [31]: for k, v in d.items():
for item in v:
d2[item] = k
....:
In [32]: d2
Out[32]: {'Amy': 'A', 'Ben': 'B', 'Bob': 'B', 'Carl': 'C', 'Chris': 'C'}
In [34]: m['Group'] = m['Name'].map(d2)
In [35]: m
Out[35]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 Other 3 NaN
11 2014 Other 6 NaN
[12 rows x 4 columns]
And moving 'Other' from Name to Group
In [8]: mask = m['Name'] == 'Other'
In [9]: m.loc[mask, 'Name'] = ''
In [10]: m.loc[mask, 'Group'] = 'Other'
In [11]: m
Out[11]:
Year Name value Group
0 2013 Amy 2 A
1 2014 Amy 9 A
2 2013 Bob 4 B
3 2014 Bob 2 B
4 2013 Carl 7 C
.. ... ... ... ...
7 2014 Chris 5 C
8 2013 Ben 1 B
9 2014 Ben 5 B
10 2013 3 Other
11 2014 6 Other
[12 rows x 4 columns]
Pandas Melt Function :-
This function is useful to massage a DataFrame into a format where one or more columns are identifier variables (id_vars), while all other columns, considered measured variables (value_vars), are “unpivoted” to the row axis, leaving just two non-identifier columns, ‘variable’ and ‘value’.
eg:-
melted = pd.melt(df, id_vars=["weekday"],
var_name="Person", value_name="Score")
we use melt to transform wide data to long data.
I'm trying to get all records where the mean of the last 3 rows is greater than the overall mean for all rows in a filtered set.
_filtered_d_all = _filtered_d.iloc[:, 0:50].loc[:, _filtered_d.mean()>0.05]
_last_n_records = _filtered_d.tail(3)
Something like this
_filtered_growing = _filtered_d.iloc[:, 0:50].loc[:, _last_n_records.mean() > _filtered_d.mean()]
However, the problem here is that the value length is incorrect. Any tips?
ValueError: Series lengths must match to compare
Sample Data
This has an index on the year and month, and 2 columns.
Col1 Col2
year month
2005 12 0.533835 0.170679
12 0.494733 0.198347
2006 3 0.440098 0.202240
6 0.410285 0.188421
9 0.502420 0.200188
12 0.522253 0.118680
2007 3 0.378120 0.171192
6 0.431989 0.145158
9 0.612036 0.178097
12 0.519766 0.252196
2008 3 0.547705 0.202163
6 0.560985 0.238591
9 0.617320 0.199537
12 0.343939 0.253855
Why not just boolean index directly on your filtered DataFrame with
df[df.tail(3).mean() > df.mean()]
Demo
>>> df
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
4 9 7 8 9 4
>>> df[df.tail(3).mean() > df.mean()]
0 1 2 3 4
0 4 8 2 4 6
1 0 0 0 2 8
2 5 3 0 9 3
3 7 5 5 1 2
Update example for MultiIndex edit
The same should work fine for your MultiIndex sample, we just have to mask a bit differently of course.
>>> df
col1 col2
2005 12 -0.340088 -0.574140
12 -0.814014 0.430580
2006 3 0.464008 0.438494
6 0.019508 -0.635128
9 0.622645 -0.824526
12 -1.674920 -1.027275
2007 3 0.397133 0.659467
6 0.026170 -0.052063
9 0.835561 0.608067
12 0.736873 -0.613877
2008 3 0.344781 -0.566392
6 -0.653290 -0.264992
9 0.080592 -0.548189
12 0.585642 1.149779
>>> df.loc[:,df.tail(3).mean() > df.mean()]
col2
2005 12 -0.574140
12 0.430580
2006 3 0.438494
6 -0.635128
9 -0.824526
12 -1.027275
2007 3 0.659467
6 -0.052063
9 0.608067
12 -0.613877
2008 3 -0.566392
6 -0.264992
9 -0.548189
12 1.149779