Transforming pandas data frame using stack function - python

I have the following pandas dataframe with me
import pandas as pd
import numpy as np
pd.np.random.seed(1)
N = 5
data = pd.DataFrame(pd.np.random.rand(N, 3), columns=['Monday', 'Wednesday', 'Friday'])
data['State'] = 'ST' + pd.Series((pd.np.arange(N) % 19).astype(str))
print data
Monday Wednesday Friday State
0 0.417022 0.720324 0.000114 ST0
1 0.302333 0.146756 0.092339 ST1
2 0.186260 0.345561 0.396767 ST2
3 0.538817 0.419195 0.685220 ST3
4 0.204452 0.878117 0.027388 ST4
I want to convert this dataframe to
0 ST0 Monday 0.417022
Wednesday 0.7203245
Friday 0.0001143748
1 ST1 Monday 0.3023326
Wednesday 0.1467559
Friday 0.09233859
2 ST2 Monday 0.1862602
Wednesday 0.3455607
Friday 0.3967675
State ST2
3 ST3 Monday 0.5388167
Wednesday 0.4191945
Friday 0.6852195
State ST3
4 ST4 Monday 0.2044522
Wednesday 0.8781174
Friday 0.02738759
State ST4
If use data.stack() alone, it will give something like,
0 Monday 0.417022
Wednesday 0.7203245
Friday 0.0001143748
State ST0
1 Monday 0.3023326
Wednesday 0.1467559
Friday 0.09233859
State ST1
2 Monday 0.1862602
Wednesday 0.3455607
Friday 0.3967675
State ST2
3 Monday 0.5388167
Wednesday 0.4191945
Friday 0.6852195
State ST3
4 Monday 0.2044522
Wednesday 0.8781174
Friday 0.02738759
State ST4
Here how can i select State column as first level and the other columns in second level in the multi-index.

You just need to move the State column into the index before stacking:
data.set_index('State', append=True).stack()
Out[4]:
State
0 ST0 Monday 0.417022
Wednesday 0.720324
Friday 0.000114
1 ST1 Monday 0.302333
Wednesday 0.146756
Friday 0.092339
2 ST2 Monday 0.186260
Wednesday 0.345561
Friday 0.396767
3 ST3 Monday 0.538817
Wednesday 0.419195
Friday 0.685220
4 ST4 Monday 0.204452
Wednesday 0.878117
Friday 0.027388
dtype: float64
Note that this doesn't exactly match the output you posted, I haven't included the State alongside the days as I think it's more sensible this way, if you really want it like your original output it would be: data.set_index('State', append=True, drop=False).stack()

You could use melt on State Column like
In [24]: pd.melt(df, id_vars=['State'])
Out[24]:
State variable value
0 ST0 Monday 0.417022
1 ST1 Monday 0.302333
2 ST2 Monday 0.186260
3 ST3 Monday 0.538817
4 ST4 Monday 0.204452
5 ST0 Wednesday 0.720324
6 ST1 Wednesday 0.146756
7 ST2 Wednesday 0.345561
8 ST3 Wednesday 0.419195
9 ST4 Wednesday 0.878117
10 ST0 Friday 0.000114
11 ST1 Friday 0.092339
12 ST2 Friday 0.396767
13 ST3 Friday 0.685220
14 ST4 Friday 0.027388

Related

How to split a dataframe by week on a particular starting weekday (e.g, Thursday)?

I'm using Python, and I have a Dataframe in which all dates and weekdays are mentioned.
And I want to divide them into Week (Like - Thursday to Thursday)
Dataframe -
And Now I want to divide this dataframe in this format-
Date Weekday
0 2021-01-07 Thursday
1 2021-01-08 Friday
2 2021-01-09 Saturday
3 2021-01-10 Sunday
4 2021-01-11 Monday
5 2021-01-12 Tuesday
6 2021-01-13 Wednesday
7 2021-01-14 Thursday,
Date Weekday
0 2021-01-14 Thursday
1 2021-01-15 Friday
2 2021-01-16 Saturday
3 2021-01-17 Sunday
4 2021-01-18 Monday
5 2021-01-19 Tuesday
6 2021-01-20 Wednesday
7 2021-01-21 Thursday,
Date Weekday
0 2021-01-21 Thursday
1 2021-01-22 Friday
2 2021-01-23 Saturday
3 2021-01-24 Sunday
4 2021-01-25 Monday
5 2021-01-26 Tuesday
6 2021-01-27 Wednesday
7 2021-01-28 Thursday,
Date Weekday
0 2021-01-28 Thursday
1 2021-01-29 Friday
2 2021-01-30 Saturday.
In this Format but i don't know how can i divide this dataframe.
You can use pandas.to_datetime if the Date is not yet datetime type, then use the dt.week accessor to groupby:
dfs = [g for _,g in df.groupby(pd.to_datetime(df['Date']).dt.week)]
Alternatively, if you have several years, use dt.to_period:
dfs = [g for _,g in df.groupby(pd.to_datetime(df['Date']).dt.to_period('W'))]
output:
[ Date Weekday
0 2021-01-07 Thursday
1 2021-01-08 Friday
2 2021-01-09 Saturday
3 2021-01-10 Sunday,
Date Weekday
4 2021-01-11 Monday
5 2021-01-12 Tuesday
6 2021-01-13 Wednesday
7 2021-01-14 Thursday
8 2021-01-14 Thursday
9 2021-01-15 Friday
10 2021-01-16 Saturday
11 2021-01-17 Sunday,
Date Weekday
12 2021-01-18 Monday
13 2021-01-19 Tuesday
14 2021-01-20 Wednesday
15 2021-01-21 Thursday
16 2021-01-21 Thursday
17 2021-01-22 Friday
18 2021-01-23 Saturday
19 2021-01-24 Sunday,
Date Weekday
20 2021-01-25 Monday
21 2021-01-26 Tuesday
22 2021-01-27 Wednesday
23 2021-01-28 Thursday
24 2021-01-28 Thursday
25 2021-01-29 Friday
26 2021-01-30 Saturday]
variants
As dictionary:
{k:g for k,g in df.groupby(pd.to_datetime(df['Date']).dt.to_period('W'))}
reset_index of subgroups:
[g.reset_index() for _,g in df.groupby(pd.to_datetime(df['Date']).dt.to_period('W'))]
weeks ending on Wednesday/starting on Thursday with anchor offsets:
[g.reset_index() for _,g in df.groupby(pd.to_datetime(df['Date']).dt.to_period('W-WED'))]

Get percentage of each row grouped by a value

I have the following df:
df3 = pd.DataFrame(np.array([['Iza', 'Tuesday'],['Martin', 'Friday'],['John', 'Monday'],['Iza', 'Tuesday'],['Iza', 'Tuesday'],['Iza', 'Wednesday'],['Sara', 'Friday'], ['Sara', 'Friday'], ['Sara', 'Sunday'],['Silvia', 'Monday'],['Silvia', 'Wednesday'],['Paul', 'Monday'],['Paul', 'Tuesday'],['Paul', 'Wednesday']]),
columns=['Name', 'Day'])
df3:
Name Day
0 Iza Tuesday
1 Martin Friday
2 John Monday
3 Iza Tuesday
4 Iza Tuesday
5 Iza Wednesday
6 Sara Friday
7 Sara Friday
8 Sara Sunday
9 Silvia Monday
10 Silvia Wednesday
11 Paul Monday
12 Paul Tuesday
13 Paul Wednesday
I got the count of days for each user:
oo = df3.groupby(['Name','Day'])['Day'].size().reset_index(name='counts')
result:
Name Day counts
0 Iza Tuesday 3
1 Iza Wednesday 1
2 John Monday 1
3 Martin Friday 1
4 Paul Monday 1
5 Paul Tuesday 1
6 Paul Wednesday 1
7 Sara Friday 2
8 Sara Sunday 1
9 Silvia Monday 1
10 Silvia Wednesday 1
dropped unwanted users with only one day record;
uniq_us = oo[oo.duplicated(['Name'], keep=False)]
result:
Name Day counts
0 Iza Tuesday 3
1 Iza Wednesday 1
4 Paul Monday 1
5 Paul Tuesday 1
6 Paul Wednesday 1
7 Sara Friday 2
8 Sara Sunday 1
9 Silvia Monday 1
10 Silvia Wednesday 1
Now I want to get the percentage of counts in each grouped days by name:
uniq_us.groupby(['Name','Day'])['counts'].apply(lambda x: x.value_counts(normalize=True)) * 100
I got:
Name Day
Iza Tuesday 3 100.0
Wednesday 1 100.0
Paul Monday 1 100.0
Tuesday 1 100.0
Wednesday 1 100.0
Sara Friday 2 100.0
Sunday 1 100.0
Silvia Monday 1 100.0
Wednesday 1 100.0
Name: counts, dtype: float64
I do not know how can I calculate it per grouped Name
Desired output:
Name Day
Iza Tuesday 3 75.0
Wednesday 1 25.0
Paul Monday 1 33.33
Tuesday 1 33.33
Wednesday 1 33.33
Sara Friday 2 66.66
Sunday 1 33.34
Silvia Monday 1 50.0
Wednesday 1 50.0
Name: counts, dtype: float64
You can normalize the counts with their sum via transform:
uniq_us["pcnt"] = uniq_us.groupby("Name").counts.transform(lambda x: x / x.sum())
to get
>>> uniq_us
Name Day counts pcnt
0 Iza Tuesday 3 0.750000
1 Iza Wednesday 1 0.250000
4 Paul Monday 1 0.333333
5 Paul Tuesday 1 0.333333
6 Paul Wednesday 1 0.333333
7 Sara Friday 2 0.666667
8 Sara Sunday 1 0.333333
9 Silvia Monday 1 0.500000
10 Silvia Wednesday 1 0.500000
You can put 100 * and round(2) in lambda and set the Name and Day as the index to match the output:
...transform(lambda x: (100 * x / x.sum()).round(2))
uniq_us = uniq_us.set_index(["Name", "Day"])
to get
counts pcnt
Name Day
Iza Tuesday 3 75.00
Wednesday 1 25.00
Paul Monday 1 33.33
Tuesday 1 33.33
Wednesday 1 33.33
Sara Friday 2 66.67
Sunday 1 33.33
Silvia Monday 1 50.00
Wednesday 1 50.00
You're almost there. Try:
>>> uniq_us.groupby(["Name", "Day"]).sum()/uniq_us.groupby("Name").sum()
counts
Name Day
Iza Tuesday 0.750000
Wednesday 0.250000
Paul Monday 0.333333
Tuesday 0.333333
Wednesday 0.333333
Sara Friday 0.666667
Sunday 0.333333
Silvia Monday 0.500000
Wednesday 0.500000
Another option is to normalize the count at early stage:
(df3.groupby('Name')
.Day
.value_counts(normalize=True)
.mul(100)
.rename('Counts')
.reset_index()
.pipe(lambda x: x[x.duplicated(['Name'], keep=False)]))
# Name Day Counts
#0 Iza Tuesday 75.000000
#1 Iza Wednesday 25.000000
#4 Paul Monday 33.333333
#5 Paul Tuesday 33.333333
#6 Paul Wednesday 33.333333
#7 Sara Friday 66.666667
#8 Sara Sunday 33.333333
#9 Silvia Monday 50.000000
#10 Silvia Wednesday 50.000000

Date Offset in pandas data range

I have the following formula which get me EOM date every 3M starting Feb 90.
dates = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M")
I am looking to get in a condensed manner the same table but where the dates are offset by x business days.
This mean, if x = 2, 2 business days before the EOM date calculated every 3M starting Feb 90.
Thanks for the help.
from pandas.tseries.offsets import BDay
x = 2
dates = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M") - BDay(x)
>>> dates
DatetimeIndex(['1990-02-26', '1990-05-29', '1990-08-29', '1990-11-28',
'1991-02-26', '1991-05-29', '1991-08-29', '1991-11-28',
'1992-02-27', '1992-05-28',
...
'2027-05-27', '2027-08-27', '2027-11-26', '2028-02-25',
'2028-05-29', '2028-08-29', '2028-11-28', '2029-02-26',
'2029-05-29', '2029-08-29'],
dtype='datetime64[ns]', length=159, freq=None)
Example
x = 2
dti1 = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M")
dti2 = pd.date_range(start="1990-02-01", end="2029-09-30", freq="3M") - BDay(x)
df = pd.DataFrame({"dti1": dti1.day_name(), "dti2": dti2.day_name()})
>>> df.head(20)
dti1 dti2
0 Wednesday Monday
1 Thursday Tuesday
2 Friday Wednesday
3 Friday Wednesday
4 Thursday Tuesday
5 Friday Wednesday
6 Saturday Thursday
7 Saturday Thursday
8 Saturday Thursday
9 Sunday Thursday
10 Monday Thursday
11 Monday Thursday
12 Sunday Thursday
13 Monday Thursday
14 Tuesday Friday
15 Tuesday Friday
16 Monday Thursday
17 Tuesday Friday
18 Wednesday Monday
19 Wednesday Monday

Change Saturdays and Sundays to Fridays

My DataFrame:
start_trade week_day
0 2021-01-16 09:30:00 Saturday
1 2021-01-19 14:30:00 Tuesday
2 2021-01-25 22:00:00 Monday
3 2021-01-29 12:15:00 Friday
4 2021-01-31 12:35:00 Sunday
There are no trades on the exchange on Saturday and Sunday. Therefore, if my trading signal falls on the weekend, I want to open a trade on Friday 23:50.
Expexted output:
start_trade week_day
0 2021-01-15 23:50:00 Friday
1 2021-01-19 14:30:00 Tuesday
2 2021-01-25 22:00:00 Monday
3 2021-01-29 12:15:00 Friday
4 2021-01-29 23:50:00 Friday
How to do it?
You can do it playing with to_timedelta to change the date to the Friday of the week and then set the time with Timedelta. Do this only on the rows wanted with the mask
#for week ends dates
mask = df['start_trade'].dt.weekday.isin([5,6])
df.loc[mask, 'start_trade'] = (df['start_trade'].dt.normalize() # to get midnight
- pd.to_timedelta(df['start_trade'].dt.weekday-4, unit='D') # to get the friday date
+ pd.Timedelta(hours=23, minutes=50)) # set 23:50 for time
df.loc[mask, 'week_day'] = 'Friday'
print(df)
start_trade week_day
0 2021-01-15 23:50:00 Friday
1 2021-01-19 14:30:00 Tuesday
2 2021-01-25 22:00:00 Monday
3 2021-01-29 12:15:00 Friday
4 2021-01-29 23:50:00 Friday
Try:
weekend = df['week_day'].isin(['Saturday', 'Sunday'])
df.loc[weekend, 'week_day'] = 'Friday'
Or np.where along with str.contains, and | operator:
df['week_day'] = np.where(df['week_day'].str.contains(r'Saturday|Sunday'),'Friday',df['week_day'])

How to set the columns in pandas

Here is my dataframe:
Dec-18 Jan-19 Feb-19 Mar-19 Apr-19 May-19
Saturday 2540.0 2441.0 3832.0 4093.0 1455.0 2552.0
Sunday 1313.0 1891.0 2968.0 2260.0 1454.0 1798.0
Monday 1360.0 1558.0 2967.0 2156.0 1564.0 1752.0
Tuesday 1089.0 2105.0 2476.0 1577.0 1744.0 1457.0
Wednesday 1329.0 1658.0 2073.0 2403.0 1231.0 874.0
Thursday 798.0 1195.0 2183.0 1287.0 1460.0 1269.0
I have tried some pandas ops but I am not able to do that.
This is what I want to do:
items
Saturday 2540.0
Sunday 1313.0
Monday 1360.0
Tuesday 1089.0
Wednesday 1329.0
Thursday 798.0
Saturday 2441.0
Sunday 1891.0
Monday 1558.0
Tuesday 2105.0
Wednesday 1658.0
Thursday 1195.0 ............ and so on
I want to set those rows into rows in downside, how to do that?
df.reset_index().melt(id_vars='index').drop('variable',1)
Output:
index value
0 Saturday 2540.0
1 Sunday 1313.0
2 Monday 1360.0
3 Tuesday 1089.0
4 Wednesday 1329.0
5 Thursday 798.0
6 Saturday 2441.0
7 Sunday 1891.0
8 Monday 1558.0
9 Tuesday 2105.0
10 Wednesday 1658.0
11 Thursday 1195.0
12 Saturday 3832.0
13 Sunday 2968.0
14 Monday 2967.0
15 Tuesday 2476.0
16 Wednesday 2073.0
17 Thursday 2183.0
18 Saturday 4093.0
19 Sunday 2260.0
20 Monday 2156.0
21 Tuesday 1577.0
22 Wednesday 2403.0
23 Thursday 1287.0
24 Saturday 1455.0
25 Sunday 1454.0
26 Monday 1564.0
27 Tuesday 1744.0
28 Wednesday 1231.0
29 Thursday 1460.0
30 Saturday 2552.0
31 Sunday 1798.0
32 Monday 1752.0
33 Tuesday 1457.0
34 Wednesday 874.0
35 Thursday 1269.0
Note: just noted a commented suggesting to do the same thing, I will delete my post if requested :)
Create it with numpy by reshaping the data.
import pandas as pd
import numpy as np
pd.DataFrame(df.to_numpy().flatten('F'),
index=np.tile(df.index, df.shape[1]),
columns=['items'])
Output:
items
Saturday 2540.0
Sunday 1313.0
Monday 1360.0
Tuesday 1089.0
Wednesday 1329.0
Thursday 798.0
Saturday 2441.0
...
Sunday 1798.0
Monday 1752.0
Tuesday 1457.0
Wednesday 874.0
Thursday 1269.0
You can do:
df = df.stack().sort_index(level=1).reset_index(level = 1, drop=True).to_frame('items')
It is interesting that this method got overlooked even though it is the fastest:
import time
start = time.time()
df.stack().sort_index(level=1).reset_index(level = 1, drop=True).to_frame('items')
end = time.time()
print("time taken {}".format(end-start))
yields: time taken 0.006181955337524414
while this:
start = time.time()
df.reset_index().melt(id_vars='days').drop('variable',1)
end = time.time()
print("time taken {}".format(end-start))
yields: time taken 0.010072708129882812
Any my output format matches OP's requested exactly.

Categories