Python: pivot dataset - python

I have a data frame which looks like:
df = pd.DataFrame({'Dev':[1,2,3,4,5,6,7,8,9,10,11,12],'2012':[1,2,3,4,5,6,7,8,9,10,11,12],
'GWP':[0,0,0,10,20,30,40,50,60,70,80,90],'Inc':[0,0,0,10,20,30,40,50,60,70,80,90],
'Dev1':[1,2,3,4,5,6,7,8,9,10,np.nan,np.nan],'2013':[1,2,3,4,5,6,7,8,9,10,np.nan,np.nan],
'GWP1':[0,0,0,10,20,30,40,50,60,70,np.nan,np.nan],'Inc1':[0,0,0,10,20,30,40,50,60,70,np.nan,np.nan],
'Dev2':[1,2,3,4,5,6,7,8,np.nan,np.nan,np.nan,np.nan],'2014':[1,2,3,4,5,6,7,8,np.nan,np.nan,np.nan,np.nan],
'GWP2':[0,0,0,10,20,30,40,50,np.nan,np.nan,np.nan,np.nan],'Inc2':[0,0,0,10,20,30,40,50,np.nan,np.nan,np.nan,np.nan],
})
df.head()
Dev 2012 GWP Inc Dev1 2013 GWP1 Inc1 Dev2 2014 GWP2 Inc2
0 1 1 0 0 1.0 1.0 0.0 0.0 1.0 1.0 0.0 0.0
1 2 2 0 0 2.0 2.0 0.0 0.0 2.0 2.0 0.0 0.0
2 3 3 0 0 3.0 3.0 0.0 0.0 3.0 3.0 0.0 0.0
3 4 4 10 10 4.0 4.0 10.0 10.0 4.0 4.0 10.0 10.0
4 5 5 20 20 5.0 5.0 20.0 20.0 5.0 5.0 20.0 20.0
I'm trying to pivot this dataframe to the following:
result_df = pd.DataFrame({'Dev':list(np.arange(1,13))*3,'YEAR':[2012]*12 + [2013]*12 + [2014]*12,
'GWP':[0,0,0,10,20,30,40,50,60,70,80,90] + [0,0,0,10,20,30,40,50,60,70,np.nan,np.nan] + [0,0,0,10,20,30,40,50,np.nan,np.nan,np.nan,np.nan],
'Inc':[0,0,0,10,20,30,40,50,60,70,80,90] + [0,0,0,10,20,30,40,50,60,70,np.nan,np.nan] + [0,0,0,10,20,30,40,50,np.nan,np.nan,np.nan,np.nan]})
result_df.head()
Out[83]:
Dev YEAR GWP Inc
0 1 2012 0.0 0.0
1 2 2012 0.0 0.0
2 3 2012 0.0 0.0
3 4 2012 10.0 10.0
4 5 2012 20.0 20.0
Does anyone know how this is possible using pandas or R?

Consider melt and wide_to_long. Specifically, melt the year columns, 2012-2014, then rename columns to adhere to stubsuffix style. Finally, reshape across multiple columns according to subs, Dev, GWP, Inc:
melt_df = (df.melt(id_vars = df.columns[~df.columns.isin(['2012', '2013', '2014'])],
value_vars=['2012', '2013', '2014'], var_name='Year')
.drop(columns=['value'])
.rename(columns={'GWP':'GWP0', 'Inc':'Inc0', 'Dev':'Dev0'})
)
final_df = pd.wide_to_long(melt_df.assign(id = lambda x: x.index),
["Dev", "GWP", "Inc"], i="id", j="suffix")
print(final_df.head(20))
# Year GWP Inc Dev
# id suffix
# 0 0 2012 0.0 0.0 1.0
# 1 0 2012 0.0 0.0 2.0
# 2 0 2012 0.0 0.0 3.0
# 3 0 2012 10.0 10.0 4.0
# 4 0 2012 20.0 20.0 5.0
# 5 0 2012 30.0 30.0 6.0
# 6 0 2012 40.0 40.0 7.0
# 7 0 2012 50.0 50.0 8.0
# 8 0 2012 60.0 60.0 9.0
# 9 0 2012 70.0 70.0 10.0
# 10 0 2012 80.0 80.0 11.0
# 11 0 2012 90.0 90.0 12.0
# 12 0 2013 0.0 0.0 1.0
# 13 0 2013 0.0 0.0 2.0
# 14 0 2013 0.0 0.0 3.0
# 15 0 2013 10.0 10.0 4.0
# 16 0 2013 20.0 20.0 5.0
# 17 0 2013 30.0 30.0 6.0
# 18 0 2013 40.0 40.0 7.0
# 19 0 2013 50.0 50.0 8.0

Related

Relative minimum values in pandas

i have the following dataframe in pandas:
Race_ID Athlete_ID Finish_time
0 1.0 1.0 56.1
1 1.0 3.0 60.2
2 1.0 2.0 57.1
3 1.0 4.0 57.2
4 2.0 2.0 56.2
5 2.0 1.0 56.3
6 2.0 3.0 56.4
7 2.0 4.0 56.5
8 3.0 1.0 61.2
9 3.0 2.0 62.1
10 3.0 3.0 60.4
11 3.0 4.0 60.0
12 4.0 2.0 55.0
13 4.0 1.0 54.0
14 4.0 3.0 53.0
15 4.0 4.0 52.0
where Race_ID is in descending order of time. (i.e. 1 is the most current race nad 4 is the oldest race)
And I want to add a new column Relative_time#t-1 which is the Athlete's Finish_time in the last race relative to the fastest time in the last race. Hence the output would look something like
Race_ID Athlete_ID Finish_time Relative_time#t-1
0 1.0 1.0 56.1 56.3/56.2
1 1.0 3.0 60.2 56.4/56.2
2 1.0 2.0 57.1 56.2/56.2
3 1.0 4.0 57.2 56.5/56.2
4 2.0 2.0 56.2 62.1/60
5 2.0 1.0 56.3 61.2/60
6 2.0 3.0 56.4 60.4/60
7 2.0 4.0 56.5 60/60
8 3.0 1.0 61.2 54/52
9 3.0 2.0 62.1 55/52
10 3.0 3.0 60.4 53/52
11 3.0 4.0 60.0 52/52
12 4.0 2.0 55.0 0
13 4.0 1.0 54.0 0
14 4.0 3.0 53.0 0
15 4.0 4.0 52.0 0
Here's the code:
data = [[1,1,56.1,'56.3/56.2'],
[1,3,60.2,'56.4/56.2'],
[1,2,57.1,'56.2/56.2'],
[1,4,57.2,'56.5/56.2'],
[2,2,56.2,'62.1/60'],
[2,1,56.3,'61.2/60'],
[2,3,56.4,'60.4/60'],
[2,4,56.5,'60/60'],
[3,1,61.2,'54/52'],
[3,2,62.1,'55/52'],
[3,3,60.4,'53/52'],
[3,4,60,'52/52'],
[4,2,55,'0'],
[4,1,54,'0'],
[4,3,53,'0'],
[4,4,52,'0']]
df = pd.DataFrame(data,columns=['Race_ID','Athlete_ID','Finish_time','Relative_time#t-1'],dtype=float)
I intentionally made the Relative_time#t-1 as str instead of int to show the formula.
Here is what I have tried:
df.sort_values(by = ['Race_ID', 'Athlete_ID'], ascending=[True, True], inplace=True)
df['Finish_time#t-1'] = df.groupby('Athlete_ID')['Finish_time'].shift(-1)
df['Finish_time#t-1'] = df['Finish_time#t-1'].replace(np.nan, 0, regex = True)
So I get the numerator for the new column but I don't know how to get the minimum time for each Race_ID (i.e. the value in the denominator)
Thank you in advance.
Try this:
(df.groupby('Athlete_ID')['Finish_time']
.shift(-1)
.div(df['Race_ID'].map(
df.groupby('Race_ID')['Finish_time']
.min()
.shift(-1)))
.fillna(0))
Output:
0 1.001779
1 1.003559
2 1.000000
3 1.005338
4 1.035000
5 1.020000
6 1.006667
7 1.000000
8 1.038462
9 1.057692
10 1.019231
11 1.000000
12 0.000000
13 0.000000
14 0.000000
15 0.000000

Pandas df group by count elements

My dataframe looks like this.
# initialize list of lists
data = [[1998, 1998,2002,2003], [2001, 1999,1993,2003], [1998, 1999,2003,1994], [1998,1997,2003,1993], [1999,2001,1996, 1999]]
df = pd.DataFrame(data, columns = ['A', 'B', 'C', 'D'])
I would like to count for each date the number of occurrences in %. Such that the dataframe looks like this:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
I tried to use Pandas group-by.
The logic is not fully clear (since it looks that the provided output is not the real one corresponding to the provided input), but here are some approaches:
using crosstab
Percent per year
df2 = df.melt(value_name='year')
df2 = pd.crosstab(df2['variable'], df2['year'], normalize='columns').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum()).mul(100)
Output:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 75.0 25.0 50.0 0.0 0.0
B 0.0 0.0 0.0 100.0 25.0 50.0 50.0 0.0 0.0
C 50.0 0.0 100.0 0.0 0.0 0.0 0.0 100.0 50.0
D 50.0 100.0 0.0 0.0 0.0 25.0 0.0 0.0 50.0
Percent per variable
df2 = df.melt(value_name='year')
pd.crosstab(df2['variable'], df2['year'], normalize='index').mul(100)
# or
# df2 = pd.crosstab(df2['variable'], df2['year'])
# df2.div(df2.sum(1), axis=0).mul(100)
Output:
year 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
using groupby
(df.stack()
.groupby(level=1)
.apply(lambda s: s.value_counts(normalize=True))
.unstack(fill_value=0)
.mul(100)
)
Output:
1993 1994 1996 1997 1998 1999 2001 2002 2003
A 0.0 0.0 0.0 0.0 60.0 20.0 20.0 0.0 0.0
B 0.0 0.0 0.0 20.0 20.0 40.0 20.0 0.0 0.0
C 20.0 0.0 20.0 0.0 0.0 0.0 0.0 20.0 40.0
D 20.0 20.0 0.0 0.0 0.0 20.0 0.0 0.0 40.0
Another option could be the following:
# getting value_counts for each column
df2 = pd.concat([df[col].value_counts(normalize=True) for col in df.columns], axis=1)
# filling null values with 0
df2.fillna(0, inplace=True)
# transforming to string and adding %
df2 = df2.astype('int').astype('str')+'%'
# getting your output
df2.loc['1997':'1999', 'A':'C'].T
Output:
1997 1998 1999
A 20% 80% 100%
B 30% 10% 0%
C 70% 10% 0%
melt + groupby + unstack
(df.melt().groupby(['variable', 'value']).size()
/ df.melt().groupby('value').size()).unstack(1)
Out[1]:
value 1993 1994 1996 1997 1998 1999 2001 2002 2003
variable
A NaN NaN NaN NaN 0.75 0.25 0.5 NaN NaN
B NaN NaN NaN 1.0 0.25 0.50 0.5 NaN NaN
C 0.5 NaN 1.0 NaN NaN NaN NaN 1.0 0.5
D 0.5 1.0 NaN NaN NaN 0.25 NaN NaN 0.5

Store the result in new columns, named based on another variable (Pandas)

I have a dataframe. What I need is to calculate the difference between the variables A and B, and store the result in the new columns based on the variable df['Value']. If the Value == 1, then the result is stored in column named Diff_1, if the Value == 2, then column Diff_2, and so on.
Here is the code so far, but obviously the line df_red['Diff_' + str(value) ] = df_red['A'] - df_red['B'] is not doing what I want:
import pandas as pd
df = pd.read_excel(r'E:\...\.xlsx')
print(df)
value = list(set(df['Value']))
print(value)
for value in value:
df_red = df[df['Value'] == value]
df_red['Diff_' + str(value) ] = df_red['A'] - df_red['B']
Out[126]:
ID Value A B
0 1 1 56.0 49.0
1 2 3 56.0 50.0
2 3 4 103.0 44.0
3 4 2 89.0 44.0
4 5 1 84.0 41.0
5 6 1 77.0 43.0
6 7 2 71.0 35.0
7 8 4 77.0 32.0
print(value)
[1, 2, 3, 4]
After a simple operation of df['A'] - df['B'] the result should look like this.
Out[128]:
ID Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 1 56.0 49.0 7.0 0.0 0.0 0.0
1 2 3 56.0 50.0 0.0 0.0 6.0 0.0
2 3 4 103.0 44.0 0.0 0.0 0.0 60.0
3 4 2 89.0 44.0 0.0 45.0 0.0 0.0
4 5 1 84.0 41.0 43.0 0.0 0.0 0.0
5 6 1 77.0 43.0 34.0 0.0 0.0 0.0
6 7 2 71.0 35.0 0.0 36.0 0.0 0.0
7 8 4 77.0 32.0 0.0 0.0 0.0 45.0
Not so great way of doing this would be like this, however I am looking for some more efficient, better ways:
df['Diff_1'] = df[df['Value']==1]['A'] - df[df['Value']==1]['B']
df['Diff_2'] = df[df['Value']==2]['A'] - df[df['Value']==2]['B']
df['Diff_3'] = df[df['Value']==3]['A'] - df[df['Value']==3]['B']
df['Diff_4'] = df[df['Value']==4]['A'] - df[df['Value']==4]['B']
You can use:
df.join(df.set_index(['ID', 'Value'])
.eval('A-B')
.unstack(level=1).add_prefix('Diff_')
.reset_index(drop=True)
)
output:
ID Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 1 56.0 49.0 7.0 NaN NaN NaN
1 2 3 56.0 50.0 NaN NaN 6.0 NaN
2 3 4 103.0 44.0 NaN NaN NaN 59.0
3 4 2 89.0 44.0 NaN 45.0 NaN NaN
4 5 1 84.0 41.0 43.0 NaN NaN NaN
5 6 1 77.0 43.0 34.0 NaN NaN NaN
6 7 2 71.0 35.0 NaN 36.0 NaN NaN
7 8 4 77.0 32.0 NaN NaN NaN 45.0
Here is my approach which may not be fastest, but it's a start:
for i in df['Value'].unique():
df.loc[df['Value'] == i, 'Diff_' + str(i)] = df['A'] - df['B']
df.fillna(0, inplace = True)
Output of my fake data:
Value A B Diff_1 Diff_2 Diff_3 Diff_4
0 1 20 2 18.0 0.0 0.0 0.0
1 1 30 5 25.0 0.0 0.0 0.0
2 2 40 7 0.0 33.0 0.0 0.0
3 2 50 15 0.0 35.0 0.0 0.0
4 3 60 25 0.0 0.0 35.0 0.0
5 3 20 7 0.0 0.0 13.0 0.0
6 4 15 36 0.0 0.0 0.0 -21.0
7 4 14 3 0.0 0.0 0.0 11.0

how to sort dataframe rows in pandas wrt to months from Jan to Dec

How can we sort the below rows in dataframe wrt to month from Jan to Dec,
currently this dataframe is in alphabetical order.
0 Col1 Col2 Col3 ... Col22 Col23 Col24
1 April 53.0 0.0 ... 11.0 0.0 0.0
2 August 43.0 0.0 ... 11.0 3.0 5.0
3 December 36.0 0.0 ... 4.0 1.0 0.0
4 February 48.0 0.0 ... 16.0 0.0 0.0
5 January 55.0 0.0 ... 24.0 4.0 0.0
6 July 45.0 0.0 ... 4.0 8.0 1.0
7 June 34.0 0.0 ... 4.0 8.0 1.0
8 March 34.0 2.0 ... 24.0 4.0 1.0
9 May 52.0 1.0 ... 3.0 2.0 1.0
10 November 33.0 0.0 ... 7.0 2.0 3.0
11 October 21.0 1.0 ... 7.0 1.0 2.0
12 September 27.0 0.0 ... 5.0 3.0 3.0
We can also use Series.date_range with month_name() and month:
month = pd.date_range(start='2018-01', freq='M', periods=12)
df.loc[df['Col1'].map(dict(zip(month.month_name(),month.month))).sort_values().index]
Col1 Col2 Col3 Col22 Col23 Col24
5 January 55.0 0.0 24.0 4.0 0.0
4 February 48.0 0.0 16.0 0.0 0.0
8 March 34.0 2.0 24.0 4.0 1.0
1 April 53.0 0.0 11.0 0.0 0.0
9 May 52.0 1.0 3.0 2.0 1.0
7 June 34.0 0.0 4.0 8.0 1.0
6 July 45.0 0.0 4.0 8.0 1.0
2 August 43.0 0.0 11.0 3.0 5.0
12 September 27.0 0.0 5.0 3.0 3.0
11 October 21.0 1.0 7.0 1.0 2.0
10 November 33.0 0.0 7.0 2.0 3.0
3 December 36.0 0.0 4.0 1.0 0.0
You can use calender to create a month number integer mapping , then sort the values and reindex:
import calendar
df.reindex(df['Col1'].map({i:e
for e,i in enumerate(calendar.month_name)}).sort_values().index)
Col1 Col2 Col3 ... Col22 Col23 Col24
5 January 55.0 0.0 ... 24.0 4.0 0.0
4 February 48.0 0.0 ... 16.0 0.0 0.0
8 March 34.0 2.0 ... 24.0 4.0 1.0
1 April 53.0 0.0 ... 11.0 0.0 0.0
9 May 52.0 1.0 ... 3.0 2.0 1.0
7 June 34.0 0.0 ... 4.0 8.0 1.0
6 July 45.0 0.0 ... 4.0 8.0 1.0
2 August 43.0 0.0 ... 11.0 3.0 5.0
12 September 27.0 0.0 ... 5.0 3.0 3.0
11 October 21.0 1.0 ... 7.0 1.0 2.0
10 November 33.0 0.0 ... 7.0 2.0 3.0
3 December 36.0 0.0 ... 4.0 1.0 0.0

Getting most recent observation & date from several columns

Take the following toy DataFrame:
data = np.arange(35, dtype=np.float32).reshape(7, 5)
data = pd.concat((
pd.DataFrame(list('abcdefg'), columns=['field1']),
pd.DataFrame(data, columns=['field2', '2014', '2015', '2016', '2017'])),
axis=1)
data.iloc[1:4, 4:] = np.nan
data.iloc[4, 3:] = np.nan
print(data)
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 4.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
I'd like to replace the "year" columns (2014-2017) with two fields: the most recent non-null observation, and the corresponding year of that observation. Assume field1 is a unique key. (I'm not looking to do any groupby ops, just 1 row per record.) I.e.:
field1 field2 obs date
0 a 0.0 4.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017
I've gotten this far:
pd.melt(data, id_vars=['field1', 'field2'],
value_vars=['2014', '2015', '2016', '2017'])\
.dropna(subset=['value'])
field1 field2 variable value
0 a 0.0 2014 1.0
1 b 5.0 2014 6.0
2 c 10.0 2014 11.0
3 d 15.0 2014 16.0
4 e 20.0 2014 21.0
5 f 25.0 2014 26.0
6 g 30.0 2014 31.0
# ...
But am struggling with how to pivot back to desired format.
Maybe:
d2 = data.melt(id_vars=["field1", "field2"], var_name="date", value_name="obs").dropna(subset=["obs"])
d2["date"] = d2["date"].astype(int)
df = d2.loc[d2.groupby(["field1", "field2"])["date"].idxmax()]
which gives me
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0
what about the following apporach:
In [160]: df
Out[160]:
field1 field2 2014 2015 2016 2017
0 a 0.0 1.0 2.0 3.0 -10.0
1 b 5.0 6.0 7.0 NaN NaN
2 c 10.0 11.0 12.0 NaN NaN
3 d 15.0 16.0 17.0 NaN NaN
4 e 20.0 21.0 NaN NaN NaN
5 f 25.0 26.0 27.0 28.0 29.0
6 g 30.0 31.0 32.0 33.0 34.0
In [180]: df.groupby(lambda x: 'obs' if x.isdigit() else x, axis=1) \
...: .last() \
...: .assign(date=df.filter(regex='^\d{4}').loc[:, ::-1].notnull().idxmax(1))
Out[180]:
field1 field2 obs date
0 a 0.0 -10.0 2017
1 b 5.0 7.0 2015
2 c 10.0 12.0 2015
3 d 15.0 17.0 2015
4 e 20.0 21.0 2014
5 f 25.0 29.0 2017
6 g 30.0 34.0 2017
last_valid_index + agg('last')
A=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1)
B=data.groupby(['value'] * data.shape[1], 1).agg('last')
data['date']=A
data['obs']=B
data
Out[1326]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0
By using assign we can push them into one line as blow
data.assign(date=data.iloc[:,2:].apply(lambda x : x.last_valid_index(),1),obs=data.groupby(['value'] * data.shape[1], 1).agg('last'))
Out[1340]:
field1 field2 2014 2015 2016 2017 date obs
0 a 0.0 1.0 2.0 3.0 4.0 2017 4.0
1 b 5.0 6.0 7.0 NaN NaN 2015 7.0
2 c 10.0 11.0 12.0 NaN NaN 2015 12.0
3 d 15.0 16.0 17.0 NaN NaN 2015 17.0
4 e 20.0 21.0 NaN NaN NaN 2014 21.0
5 f 25.0 26.0 27.0 28.0 29.0 2017 29.0
6 g 30.0 31.0 32.0 33.0 34.0 2017 34.0
Also another possibility by using sort_values and drop_duplicates:
data.melt(id_vars=["field1", "field2"], var_name="date",
value_name="obs")\
.dropna(subset=['obs'])\
.sort_values(['field1', 'date'], ascending=[True, False])\
.drop_duplicates('field1', keep='first')
which gives you
field1 field2 date obs
21 a 0.0 2017 4.0
8 b 5.0 2015 7.0
9 c 10.0 2015 12.0
10 d 15.0 2015 17.0
4 e 20.0 2014 21.0
26 f 25.0 2017 29.0
27 g 30.0 2017 34.0

Categories