I have a pandas df that I need to sort based on a fixed order given in a list. The problem that I'm having is that the sort I'm attempting is not moving the data rows in the designated order that I'm expecting from the order of the list. My list and DataFrame (df) looks like this:
months = ['5','6','7','8','9','10','11','12','1','2']
df =
year 1992 1993 1994 1995
month
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
The closest that I have gotten is this -
newdf = pd.DataFrame(df.values, index=list(months))
...but it does not move the rows. This command only adds the months in the index column w/out moving the data.
0 1 2 3
5 -0.343107 -0.211959 0.437974 -1.219363
6 -0.383353 0.888650 1.054926 0.714846
7 0.057198 1.246682 0.042684 0.275701
8 -0.100018 -0.801554 0.001111 0.382633
9 -0.283815 0.204448 0.350705 0.130652
10 0.042195 -0.433849 -1.481228 -0.236004
11 1.059776 0.875214 0.304638 0.127819
12 -0.328911 -0.256656 1.081157 1.057449
1 -0.488213 -0.957050 -0.813885 1.403822
2 0.973031 -0.246714 0.600157 0.579038
I need the result to look like -
year 1992 1993 1994 1995
month
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
Assuming df.index is dtype('int64'), first convert months to integers. Then use loc:
months = [*map(int, months)]
out = df.loc[months]
If df.index is dtype('O'), you can use loc right away, i.e. you don't need the first line.
Output:
year 1992 1993 1994 1995
month
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
I then apply the rank() method like this:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
Which gives me this:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.
So I modified the code above to look like this:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
Then I get this
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.
So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.
The solution is five lines. Notes:
You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
(Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)
Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
And here's the boilerplate to ingest your data:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))
I use python pandas to caculate the following formula
(https://i.stack.imgur.com/XIKBz.png)
I do it in python like this :
EURUSD['SMA2']= EURUSD['Close']. rolling (2).mean()
EURUSD['TMA2']= ( EURUSD['Close'] + EURUSD[SMA2']) / 2
The proplem is long coding when i calculated TMA 100 , so i need to use " for loop " to easy change TMA period .
Thanks in advance
Edited :
I had found the code but there is an error :
values = []
for i in range(1,201): values.append(eurusd['Close']).rolling(window=i).mean() values.mean()
TMA is average of averages.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
print(df)
# df['mean0']=df.mean(0)
df['mean1']=df.mean(1)
print(df)
df['TMA'] = df['mean1'].rolling(window=10,center=False).mean()
print(df)
Or you can easily print it.
print(df["mean1"].mean())
Here is how it looks:
0 1 2 3 4
0 0.643560 0.412046 0.072525 0.618968 0.080146
1 0.018226 0.222212 0.077592 0.125714 0.595707
2 0.652139 0.907341 0.581802 0.021503 0.849562
3 0.129509 0.315618 0.711265 0.812318 0.757575
4 0.881567 0.455848 0.470282 0.367477 0.326812
5 0.102455 0.156075 0.272582 0.719158 0.266293
6 0.412049 0.527936 0.054381 0.587994 0.442144
7 0.063904 0.635857 0.244050 0.002459 0.423960
8 0.446264 0.116646 0.990394 0.678823 0.027085
9 0.951547 0.947705 0.080846 0.848772 0.699036
0 1 2 3 4 mean1
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581
0 1 2 3 4 mean1 TMA
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449 NaN
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890 NaN
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470 NaN
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257 NaN
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397 NaN
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313 NaN
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901 NaN
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046 NaN
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842 NaN
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581 0.436115
I've been working on this all morning and for the life of me cannot figure it out. I'm sure this is very basic, but I've become so frustrated my mind is being clouded. I'm attempting to calculate the total return of a portfolio of securities at each date (monthly).
The formula is (1 + r1) * (1+r2) * (1+ r(t))..... - 1
Here is what I'm working with:
Adj_Returns = Adj_Close/Adj_Close.shift(1)-1
Adj_Returns['Risk Parity Portfolio'] = (Adj_Returns.loc['2003-01-31':]*Weights.shift(1)).sum(axis = 1)
Adj_Returns
SPY IYR LQD Risk Parity Portfolio
Date
2002-12-31 NaN NaN NaN 0.000000
2003-01-31 -0.019802 -0.014723 0.000774 -0.006840
2003-02-28 -0.013479 0.019342 0.015533 0.011701
2003-03-31 -0.001885 0.010015 0.001564 0.003556
2003-04-30 0.088985 0.045647 0.020696 0.036997
For example, with 2002-12-31 being base 100 for risk parity, I want 2003-01-31 to be 99.316 (100 * (1-0.006840)), 2003-02-28 to be 100.478 (99.316 * (1+ 0.011701)) so on and so forth.
Thanks!!
You want to use pd.DataFrame.cumprod
df.add(1).cumprod().sub(1).sum(1)
Consider the dataframe of returns df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.normal(.025, .03, (10, 5)), columns=list('ABCDE'))
df
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 0.024191 0.034487 0.035463 0.046461 0.048123
2 0.006754 0.035572 0.014424 0.012524 -0.002347
3 0.020724 0.047405 -0.020125 0.043341 0.037007
4 -0.003783 0.069827 0.014605 -0.019147 0.056897
5 0.056890 0.042756 0.033886 0.001758 0.049944
6 0.069609 0.032687 -0.001997 0.036253 0.009415
7 0.026503 0.053499 -0.006013 0.053447 0.047013
8 0.062084 0.029664 -0.015238 0.029886 0.062748
9 0.048341 0.065248 -0.024081 0.019139 0.028955
We can see the cumulative return or total return is
df.add(1).cumprod().sub(1)
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 -0.015641 0.020983 0.000139 0.001702 0.063343
2 -0.008993 0.057301 0.014565 0.014247 0.060847
3 0.011544 0.107423 -0.005853 0.058206 0.100105
4 0.007717 0.184750 0.008666 0.037944 0.162699
5 0.065046 0.235405 0.042847 0.039769 0.220768
6 0.139183 0.275786 0.040764 0.077464 0.232261
7 0.169375 0.344039 0.034505 0.135051 0.290194
8 0.241974 0.383909 0.018742 0.168973 0.371151
9 0.302013 0.474207 -0.005791 0.191346 0.410852
Plot it
df.add(1).cumprod().sub(1).plot()
Add sum of returns to new column
df.assign(Portfolio=df.add(1).cumprod().sub(1).sum(1))
A B C D E Portfolio
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521 -0.114311
1 0.024191 0.034487 0.035463 0.046461 0.048123 0.070526
2 0.006754 0.035572 0.014424 0.012524 -0.002347 0.137967
3 0.020724 0.047405 -0.020125 0.043341 0.037007 0.271425
4 -0.003783 0.069827 0.014605 -0.019147 0.056897 0.401777
5 0.056890 0.042756 0.033886 0.001758 0.049944 0.603835
6 0.069609 0.032687 -0.001997 0.036253 0.009415 0.765459
7 0.026503 0.053499 -0.006013 0.053447 0.047013 0.973165
8 0.062084 0.029664 -0.015238 0.029886 0.062748 1.184749
9 0.048341 0.065248 -0.024081 0.019139 0.028955 1.372626
I have a dataframe , which consists of three columns. And i want to append "Yes" or "No" to one of the column using python-pandas. Also the ratio between Yes:No is 7:3.
Had anyone tried this??
With numpy's random.choice:
df["new_column"] = np.random.choice(["Yes", "No"], len(df), p=[0.7, 0.3])
Note: np.random.choice consists of independent trials (unless you pass replace = False). In each trial, the probability of getting a "Yes" will be 0.7. In the end you might not end up exactly with a 70% ratio. However, with 2480500 rows this binomial distribution will approximate to a normal distribution with a mean 2480500 * 0.7 and a standard deviation sqrt(2480500 * 0.7 * 0.3). With +/-3 standard deviation (with 99.73% probability) you will end up with a ratio between (0.69913, 0.70087). But if you want exactly 70%, you can use pandas' sample as #EdChum suggested, I guess it has a correction factor.
You can use sample to achieve this:
In [11]:
df = pd.DataFrame(np.random.randn(20,3), columns=list('abc'))
df
Out[11]:
a b c
0 -0.267704 1.030417 -0.494542
1 -0.830801 0.421847 1.296952
2 -1.165387 -0.381976 -0.178988
3 -0.800799 -0.240998 -0.900573
4 0.855965 0.765313 -0.125862
5 1.153730 1.323783 -0.113135
6 0.242592 -2.137141 -0.230177
7 -0.451582 0.267415 1.006564
8 0.071916 0.476523 1.326859
9 -1.168084 0.250367 -1.235262
10 0.238183 0.391661 -1.177926
11 -1.153294 -0.304811 -0.955384
12 -0.984470 -0.351073 -1.155049
13 -2.068388 1.294905 0.892136
14 -0.196381 -1.083988 0.203369
15 -1.430208 0.859933 1.152462
16 -0.250452 0.824815 0.425096
17 1.051399 -1.199689 0.487980
18 0.688910 -0.664028 -0.097302
19 -0.355774 0.064857 0.003731
In [12]:
df.loc[df.index.to_series().sample(frac=0.7).index, 'new_col'] = 'Yes'
df['new_col'].fillna('No',inplace=True)
df
Out[12]:
a b c new_col
0 -0.267704 1.030417 -0.494542 Yes
1 -0.830801 0.421847 1.296952 Yes
2 -1.165387 -0.381976 -0.178988 No
3 -0.800799 -0.240998 -0.900573 No
4 0.855965 0.765313 -0.125862 No
5 1.153730 1.323783 -0.113135 Yes
6 0.242592 -2.137141 -0.230177 Yes
7 -0.451582 0.267415 1.006564 Yes
8 0.071916 0.476523 1.326859 No
9 -1.168084 0.250367 -1.235262 Yes
10 0.238183 0.391661 -1.177926 Yes
11 -1.153294 -0.304811 -0.955384 Yes
12 -0.984470 -0.351073 -1.155049 Yes
13 -2.068388 1.294905 0.892136 Yes
14 -0.196381 -1.083988 0.203369 No
15 -1.430208 0.859933 1.152462 Yes
16 -0.250452 0.824815 0.425096 Yes
17 1.051399 -1.199689 0.487980 Yes
18 0.688910 -0.664028 -0.097302 Yes
19 -0.355774 0.064857 0.003731 No
Basically you can call sample and pass param frac=0.7 and then use the index to mask the df and assign the 'yes' value and then call fillna to assign the 'no' values
import pandas as pd
import random
arr = ['Yes'] * 7 + ['No'] * 3
arr *= number_of_rows // 10
random.shuffle(arr)
df['column_name'] = arr
Quick and Dirty
pd.Series(np.random.rand(100)).apply(lambda x: 'Yes' if x < .7 else 'No')