Following up to my previous question here:
import pandas as pd
d = pd.DataFrame({'value':['a', 'b'],'2019Q1':[1, 5], '2019Q2':[2, 6], '2019Q3':[3, 7]})
which displays like this:
value 2019Q1 2019Q2 2019Q3
0 a 1 2 3
1 b 5 6 7
How can I transform it into this shape:
Year measure Quarter Value
2019 a 1 1
2019 a 2 2
2019 a 3 3
2019 b 1 5
2019 b 2 6
2019 b 3 7
Use pd.wide_to_long with DataFrame.melt:
df2 = df.copy()
df2.columns = df.columns.str.split('Q').str[::-1].str.join('_')
new_df = (pd.wide_to_long(df2.rename(columns = {'value':'Measure'}),
['1','2','3'],
j="Year",
i = 'Measure',
sep='_')
.reset_index()
.melt(['Measure','Year'],var_name = 'Quarter',value_name = 'Value')
.loc[:,['Year','Measure','Quarter','Value']]
.sort_values(['Year','Measure','Quarter']))
print(new_df)
Year Measure Quarter Value
0 2019 a 1 1
2 2019 a 2 2
4 2019 a 3 3
1 2019 b 1 5
3 2019 b 2 6
5 2019 b 3 7
this is just an addition for future visitors : when u split columns and use expand=True, u get a multiindex. This allows reshaping using the stack method.
#set value column as index
d = d.set_index('value')
#split columns and convert to multiindex
d.columns = d.columns.str.split('Q',expand=True)
#reshape dataframe
d.stack([0,1]).rename_axis(['measure','year','quarter']).reset_index(name='Value')
measure year quarter Value
0 a 2019 1 1
1 a 2019 2 2
2 a 2019 3 3
3 b 2019 1 5
4 b 2019 2 6
5 b 2019 3 7
Related
I have df similar to below. I need to select rows where df['Year 2'] is equal or closest to df['Year'] in subsets grouped by df['ID'] so in this example rows 1,2 and 5.
df
Year ID A Year 2 C
0 2020 12 0 2019 0
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0
4 2019 6 0 2017 0
5 2019 6 1 2018 0 <-
I am trying to achieve that with the following piece of code using group by and passing a function to get the proper row with the closest value for both columns.
df1 = df.groupby(['ID']).apply(min(df['Year 2'], key=lambda x:abs(x-df['Year'].min())))
This particular line returns 'int' object is not callable. Any ideas how to fix this line of code or a fresh approach to the problem is appreciated.
TYIA.
You can subtract both columns by Series.sub, convert to absolute and aggregate indices by minimum values by DataFrameGroupBy.idxmin:
idx = df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()
If need new column filled by boolean use Index.isin:
df['new'] = df.index.isin(idx)
print (df)
Year ID A Year 2 C new
0 2020 12 0 2019 0 False
1 2020 12 0 2020 0 True
2 2017 10 1 2017 0 True
3 2017 10 0 2018 0 False
4 2019 6 0 2017 0 False
5 2019 6 1 2018 0 True
If need filter rows use DataFrame.loc:
df1 = df.loc[idx]
print (df1)
Year ID A Year 2 C
5 2019 6 1 2018 0
2 2017 10 1 2017 0
1 2020 12 0 2020 0
One row solution:
df1 = df.loc[df['Year 2'].sub(df['Year']).abs().groupby(df['ID']).idxmin()]
You could get the idxmin per group:
idx = (df['Year 2']-df['Year']).abs().groupby(df['ID']).idxmin()
# assignment for test
df.loc[idx, 'D'] = '<-'
for selection only:
df2 = df.loc[idx]
output:
Year ID A Year 2 C D
0 2020 12 0 2019 0 NaN
1 2020 12 0 2020 0 <-
2 2017 10 1 2017 0 <-
3 2017 10 0 2018 0 NaN
4 2019 6 0 2017 0 NaN
5 2019 6 1 2018 0 <-
Note that there is a difference between:
df.loc[df.index.isin(idx)]
which gets all the min rows
and:
df.loc[idx]
which gets the first match
I have a dataframe
A B Value FY
1 5 a 2020
2 6 b 2020
3 7 c 2021
4 8 d 2021
I want to create a column 'prev_FY' which looks at the 'value' column and previous year and populates in current year row in FY column;
my desired output is:
A B Value FY prev_FY
1 5 a 2020
2 6 b 2020
3 7 c 2021 a
4 8 d 2021 b
I tried using pivottable but it does not work as the values remain the same as corresponding to the FY. SHIFT function is not feasible as I have millions of rows.
Use:
df['g'] = df.groupby('FY').cumcount()
df2 = df[['FY','Value','g']].assign(FY = df['FY'].add(1))
df = df.merge(df2, on=['FY','g'], how='left', suffixes=('','_prev')).drop('g', axis=1)
print (df)
A B Value FY Value_prev
0 1 5 a 2020 NaN
1 2 6 b 2020 NaN
2 3 7 c 2021 a
3 4 8 d 2021 b
I have a unique ID and time-series data. Time-series data contains 3 macro variables.
I want to construct the data frame, where columns are date , and they are the same. Here are example of initial and expected outputs
Length of ID is not important here
Setup
Recreate OP's dataframe
dat = [[3, 4, 1], [4, 5, 3]]
idx = [2017, 2018]
col = ['A', 'B', 'C']
df = pd.DataFrame(dat, idx, col).rename_axis('time')
pd.concat
I rap enumerate in dict where enumerate starts from 1 to match OP's ID that starts from 1
new = pd.concat(dict(enumerate([df] * 3, 1)), names=['ID']).unstack()
new.columns = [f'{x}{y}' for x, y in new.columns]
new
A2017 A2018 B2017 B2018 C2017 C2018
ID
1 3 4 4 5 1 3
2 3 4 4 5 1 3
3 3 4 4 5 1 3
Details
To see what the concatenated dataframe looks like
pd.concat(dict(enumerate([df] * 3, 1)), names=['ID'])
A B C
ID time
1 2017 3 4 1
2018 4 5 3
2 2017 3 4 1
2018 4 5 3
3 2017 3 4 1
2018 4 5 3
If we unstack it
A B C
time 2017 2018 2017 2018 2017 2018
ID
1 3 4 4 5 1 3
2 3 4 4 5 1 3
3 3 4 4 5 1 3
Only thing left to do is to smash the column levels together, which you can see how I did it above.
I already got answer to this question in R, wondering how this can be implemented in Python.
Let's say we have a pandas DataFrame like this:
import pandas as pd
d = pd.DataFrame({'2019Q1':[1], '2019Q2':[2], '2019Q3':[3]})
which displays like this:
2019Q1 2019Q2 2019Q3
0 1 2 3
How can I transform it to looks like this:
Year Quarter Value
2019 1 1
2019 2 2
2019 3 3
Use Series.str.split for MultiIndex with expand=True and then reshape by DataFrame.unstack, last data cleaning with with Series.reset_index and Series.rename_axis:
d = pd.DataFrame({'2019Q1':[1], '2019Q2':[2], '2019Q3':[3]})
d.columns = d.columns.str.split('Q', expand=True)
df = (d.unstack(0)
.reset_index(level=2, drop=True)
.rename_axis(('Year','Quarter'))
.reset_index(name='Value'))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Thank you #Jon Clements for another solution:
df = (d.melt()
.variable
.str.extract('(?P<Year>\d{4})Q(?P<Quarter>\d)')
.assign(Value=d.T.values.flatten()))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Alternative with split:
df = (d.melt()
.variable
.str.split('Q', expand=True)
.rename(columns={0:'Year',1:'Quarter'})
.assign(Value=d.T.values.flatten()))
print (df)
Year Quarter Value
0 2019 1 1
1 2019 2 2
2 2019 3 3
Using DataFrame.stack with DataFrame.pop and Series.str.split:
df = d.stack().reset_index(level=1).rename(columns={0:'Value'})
df[['Year', 'Quarter']] = df.pop('level_1').str.split('Q', expand=True)
Value Year Quarter
0 1 2019 1
0 2 2019 2
0 3 2019 3
If you care about the order of columns, use reindex:
df = df.reindex(['Year', 'Quarter', 'Value'], axis=1)
Year Quarter Value
0 2019 1 1
0 2019 2 2
0 2019 3 3
I have the following dataframe:
df2 = pd.DataFrame({'season':[1,1,1,2,2,2,3,3],'value' : [-2, 3,1,5,8,6,7,5], 'test':[3,2,6,8,7,4,25,2],'test2':[4,5,7,8,9,10,11,12]},index=['2020', '2020', '2020','2020', '2020', '2021', '2021', '2021'])
df2.index= pd.to_datetime(df2.index)
df2.index = df2.index.year
print(df2)
season test test2 value
2020 1 3 4 -2
2020 1 2 5 3
2020 1 6 7 1
2020 2 8 8 5
2020 2 7 9 8
2021 2 4 10 6
2021 3 25 11 7
2021 3 2 12 5
I would like to filter it to obtain for each year and each season of that year the maximum value of the column 'value'. How can I do that efficiently?
Expected result:
print(df_result)
season value test test2
year
2020 1 3 2 5
2020 2 8 7 9
2021 2 6 4 10
2021 3 7 25 11
Thank you for your help,
Pierre
This is a groupby operation, but a little non-trivial, so posting as an answer.
(df2.set_index('season', append=True)
.groupby(level=[0, 1])
.value.max()
.reset_index(level=1)
)
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7
You can elevate your index to a series, then perform a groupby operation on a list of columns:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])['value'].max().reset_index()
print(df_result)
year season value
0 2020 1 4
1 2020 2 8
2 2021 2 6
3 2021 3 7
If you wish, you can make year your index again via df_result = df_result.set_index('year').
To keep other columns use:
df2['year'] = df2.index
df2['value'] = df2.groupby(['year', 'season'])['value'].transform('max')
Then drop any duplicates via pd.DataFrame.drop_duplicates.
Update #1
For your new requirement, you need to apply an aggregation function for 2 series:
df2['year'] = df2.index
df_result = df2.groupby(['year', 'season'])\
.agg({'value': 'max', 'test': 'last'})\
.reset_index()
print(df_result)
year season value test
0 2020 1 4 6
1 2020 2 8 7
2 2021 2 6 2
3 2021 3 7 2
Update #2
For your finalised requirement:
df2['year'] = df2.index
df2['max_value'] = df2.groupby(['year', 'season'])['value'].transform('max')
df_result = df2.loc[df2['value'] == df2['max_value']]\
.drop_duplicates(['year', 'season'])\
.drop('max_value', 1)
print(df_result)
season value test test2 year
2020 1 3 2 5 2020
2020 2 8 7 9 2020
2021 2 6 4 10 2021
2021 3 7 25 11 2021
You can using get_level_values for bring index value into groupby
df2.groupby([df2.index.get_level_values(0),df2.season]).value.max().reset_index(level=1)
Out[38]:
season value
2020 1 4
2020 2 8
2021 2 6
2021 3 7