transitioning from r to Python - dplyr-like operations in pandas - python

I'm used to using R. If I had this in R I would do something like this:
library(dplyr)
df = df %>%
mutate(
XYZ = sum(x+y+z),
weekcheck = ifelse( week > 3 & X*2 > 4, 'yes',week), # multi-step if statement
XYZ_plus_3 = XYZ + 3
)
df = pd.DataFrame({
'x': np.random.uniform(1., 168., 20),
'y': np.random.uniform(7., 334., 20),
'z': np.random.uniform(1.7, 20.7, 20),
'month': [5,6,7,8]*5,
'week': np.random.randint(1,4, 20)
})
I know theres assign but I can't figure out the syntax for chaining these operations together, particularly using IFELSE sort of thing.
Anyone attempt to break this down for me? Even if you don't know R I think the code is fairly common sense..

You'd need two assign calls for that and the syntax is not as pretty:
(df.assign(XYZ=df[['x', 'y', 'z']].sum(axis=1),
weekcheck=np.where((df['week']>3) & (df['x']*2>4), 'yes', df['week']))
.assign(XYZ_plus_3=lambda d: d['XYZ']+3))

Not sure if this is what you're looking for, but I would do it like this in pandas. In particular, I think that np.where() is a direct analog to R's ifelse (I don't know R very well though). There may be similar way to do this in pandas but I've always found np.where() to be the fastest and most general approach.
df['xyz'] = df.x + df.y + df.z
df['wcheck'] = np.where( (df.week>2) & (df.x*2>4), 'yes', df.week )
df['xyz_p3'] = df.xyz + 3
week x y z xyz wcheck xyz_p3
0 2 1.968759 31.537797 18.984273 52.490830 2 55.490830
1 1 108.809481 295.126414 14.250059 418.185954 1 421.185954
2 3 124.094087 201.229196 15.346794 340.670077 yes 343.670077
3 2 122.874717 110.675192 6.179610 239.729519 2 242.729519
4 1 74.909326 12.484076 4.921888 92.315290 1 95.315290
You could do some or all of this as a method chain, although I don't see a particular advantage here beyond making the code a little more compact and clean (not that I am knocking that!). But much of the difference is just three lines vs "one line" that is spread across three lines...
I dunno, YMMV, but a lot of this comes down to specific examples and in this case I would just do it in three separate lines of pandas as opposed to figuring out how to do it as a method chain with assign or pipe.

Here is how you'd do it with datar, a python package that ports dplyr and other packages into python, and follows their API design:
In [1]: from datar.all import *
In [2]: df = tibble(
...: x=runif(20, 1., 168.),
...: y=runif(20, 7., 334.),
...: z=runif(20, 1.7, 20.7),
...: month=[5,6,7,8]*5,
...: week=rnorm(20, 1, 4)
...: )
In [3]: df
Out[3]:
x y z month week
0 122.186045 210.469468 3.685605 5 2.832896
1 165.584417 328.907586 8.535625 6 -0.277586
2 47.149510 205.991526 8.302771 7 -3.212263
3 88.110641 137.452398 11.920447 8 -3.307180
4 157.378195 215.928386 19.047386 5 0.442600
5 115.881867 122.972666 20.367191 6 -2.810770
6 70.939125 303.212096 2.864381 7 1.676704
7 124.173937 159.179588 16.231502 8 -1.431897
8 67.049824 266.658257 2.483528 5 -4.815040
9 165.531614 315.180892 13.855680 6 4.094581
10 59.077945 87.218260 10.638067 7 -0.204437
11 160.982998 320.093002 9.470513 8 -1.877375
12 23.520600 143.737008 1.989666 5 2.344435
13 26.028670 261.396529 19.844300 6 1.956208
14 100.008859 261.133030 15.947817 7 3.202203
15 102.298540 29.667462 4.470771 8 -4.747893
16 38.565169 239.578190 11.088213 5 0.268926
17 73.553130 49.714928 4.449677 6 -3.592172
18 74.467545 16.350189 8.195442 7 3.451417
19 162.439950 189.721896 7.729186 8 4.486240
In [4]: df >> rowwise() >> mutate(
...: XYZ=sum(f.x+f.y+f.z),
...: weekcheck=if_else((f.week > 3) & (f.x*2 > 4), 'yes', f.week),
...: XYZ_plus_3=f.XYZ+3
...: )
Out[4]:
x y z month week XYZ weekcheck XYZ_plus_3
0 122.186045 210.469468 3.685605 5 2.832896 336.341118 2.832896 339.341118
1 165.584417 328.907586 8.535625 6 -0.277586 503.027628 -0.277586 506.027628
2 47.149510 205.991526 8.302771 7 -3.212263 261.443807 -3.212263 264.443807
3 88.110641 137.452398 11.920447 8 -3.307180 237.483487 -3.30718 240.483487
4 157.378195 215.928386 19.047386 5 0.442600 392.353967 0.4426 395.353967
5 115.881867 122.972666 20.367191 6 -2.810770 259.221724 -2.81077 262.221724
6 70.939125 303.212096 2.864381 7 1.676704 377.015603 1.676704 380.015603
7 124.173937 159.179588 16.231502 8 -1.431897 299.585026 -1.431897 302.585026
8 67.049824 266.658257 2.483528 5 -4.815040 336.191610 -4.81504 339.191610
9 165.531614 315.180892 13.855680 6 4.094581 494.568187 yes 497.568187
10 59.077945 87.218260 10.638067 7 -0.204437 156.934272 -0.204437 159.934272
11 160.982998 320.093002 9.470513 8 -1.877375 490.546514 -1.877375 493.546514
12 23.520600 143.737008 1.989666 5 2.344435 169.247274 2.344435 172.247274
13 26.028670 261.396529 19.844300 6 1.956208 307.269499 1.956208 310.269499
14 100.008859 261.133030 15.947817 7 3.202203 377.089707 yes 380.089707
15 102.298540 29.667462 4.470771 8 -4.747893 136.436772 -4.747893 139.436772
16 38.565169 239.578190 11.088213 5 0.268926 289.231572 0.268926 292.231572
17 73.553130 49.714928 4.449677 6 -3.592172 127.717735 -3.592172 130.717735
18 74.467545 16.350189 8.195442 7 3.451417 99.013176 yes 102.013176
19 162.439950 189.721896 7.729186 8 4.486240 359.891031 yes 362.891031
[Rowwise: []]
I am the author of the package. Feel free to submit issues or ask me questions about using it.

Related

Sorting an Dataframe wtih Index in form of List

I have a pandas df that I need to sort based on a fixed order given in a list. The problem that I'm having is that the sort I'm attempting is not moving the data rows in the designated order that I'm expecting from the order of the list. My list and DataFrame (df) looks like this:
months = ['5','6','7','8','9','10','11','12','1','2']
df =
year 1992 1993 1994 1995
month
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
The closest that I have gotten is this -
newdf = pd.DataFrame(df.values, index=list(months))
...but it does not move the rows. This command only adds the months in the index column w/out moving the data.
0 1 2 3
5 -0.343107 -0.211959 0.437974 -1.219363
6 -0.383353 0.888650 1.054926 0.714846
7 0.057198 1.246682 0.042684 0.275701
8 -0.100018 -0.801554 0.001111 0.382633
9 -0.283815 0.204448 0.350705 0.130652
10 0.042195 -0.433849 -1.481228 -0.236004
11 1.059776 0.875214 0.304638 0.127819
12 -0.328911 -0.256656 1.081157 1.057449
1 -0.488213 -0.957050 -0.813885 1.403822
2 0.973031 -0.246714 0.600157 0.579038
I need the result to look like -
year 1992 1993 1994 1995
month
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846
Assuming df.index is dtype('int64'), first convert months to integers. Then use loc:
months = [*map(int, months)]
out = df.loc[months]
If df.index is dtype('O'), you can use loc right away, i.e. you don't need the first line.
Output:
year 1992 1993 1994 1995
month
5 0.057198 1.246682 0.042684 0.275701
6 -0.100018 -0.801554 0.001111 0.382633
7 -0.283815 0.204448 0.350705 0.130652
8 0.042195 -0.433849 -1.481228 -0.236004
9 1.059776 0.875214 0.304638 0.127819
10 -0.328911 -0.256656 1.081157 1.057449
11 -0.488213 -0.957050 -0.813885 1.403822
12 0.973031 -0.246714 0.600157 0.579038
1 -0.343107 -0.211959 0.437974 -1.219363
2 -0.383353 0.888650 1.054926 0.714846

Is there a way to rank some items in a pandas dataframe and exclude others?

I have a pandas dataframe called ranks with my clusters and their key metrics. I rank them them using rank() however there are two specific clusters which I want ranked differently to the others.
ranks = pd.DataFrame(data={'Cluster': ['0', '1', '2',
'3', '4', '5','6', '7', '8', '9'],
'No. Customers': [145118,
2,
1236,
219847,
9837,
64865,
3855,
219549,
34171,
3924120],
'Ave. Recency': [39.0197,
47.0,
15.9716,
41.9736,
23.9330,
24.8281,
26.5647,
17.7493,
23.5205,
24.7933],
'Ave. Frequency': [1.7264,
19.0,
24.9101,
3.0682,
3.2735,
1.8599,
3.9304,
3.3356,
9.1703,
1.1684],
'Ave. Monetary': [14971.85,
237270.00,
126992.79,
17701.64,
172642.35,
13159.21,
54333.56,
17570.67,
42136.68,
4754.76]})
ranks['Ave. Spend'] = ranks['Ave. Monetary']/ranks['Ave. Frequency']
Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
0 0 145118 39.0197 1.7264 14,971.85 8,672.07
1 1 2 47.0 19.0 237,270.00 12,487.89
2 2 1236 15.9716 24.9101 126,992.79 5,098.02
3 3 219847 41.9736 3.0682 17,701.64 5,769.23
4 4 9837 23.9330 3.2735 172,642.35 52,738.42
5 5 64865 24.8281 1.8599 13,159.21 7,075.19
6 6 3855 26.5647 3.9304 54,333.56 13,823.64
7 7 219549 17.7493 3.3356 17,570.67 5,267.52
8 8 34171 23.5205 9.1703 42,136.68 4,594.89
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21
I then apply the rank() method like this:
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
ranks['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank'] = ranks['overall'].rank(method='first')
Which gives me this:
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10
This does what it's suppose to do, however the cluster with the highest Ave. Spend needs to be ranked 1 at all times and the cluster with the highest Ave. Recency needs to be ranked last at all times.
So I modified the code above to look like this:
if(ranks['s_rank'].min() == 1):
ranks['overall_rank_2'] = 1
elif(ranks['r_rank'].max() == len(ranks)):
ranks['overall_rank_2'] = len(ranks)
else:
ranks_2 = ranks.drop(ranks.index[[ranks[ranks['s_rank'] == ranks['s_rank'].min()].index[0],ranks[ranks['r_rank'] == ranks['r_rank'].max()].index[0]]])
ranks_2['r_rank'] = ranks_2['Ave. Recency'].rank()
ranks_2['f_rank'] = ranks_2['Ave. Frequency'].rank(ascending=False)
ranks_2['m_rank'] = ranks_2['Ave. Monetary'].rank(ascending=False)
ranks_2['s_rank'] = ranks_2['Ave. Spend'].rank(ascending=False)
ranks_2['overall'] = ranks.apply(lambda row: row.r_rank + row.f_rank + row.m_rank + row.s_rank, axis=1)
ranks['overall_rank_2'] = ranks_2['overall'].rank(method='first')
Then I get this
Cluster No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend|r_rank|f_rank|m_rank|s_rank|overall|overall_rank|overall_rank_2
0 0 145118 39.0197 1.7264 14,971.85 8,672.07 8 9 8 4 29 9 1
1 1 2 47.0 19.0 237,270.00 12,487.89 10 2 1 3 16 3 1
2 2 1236 15.9716 24.9101 126,992.79 5,098.02 1 1 3 8 13 1 1
3 3 219847 41.9736 3.0682 17,701.64 5,769.23 9 7 6 6 28 7 1
4 4 9837 23.9330 3.2735 172,642.35 52,738.42 4 6 2 1 13 2 1
5 5 64865 24.8281 1.8599 13,159.21 7,075.19 6 8 9 5 28 8 1
6 6 3855 26.5647 3.9304 54,333.56 13,823.64 7 4 4 2 17 4 1
7 7 219549 17.7493 3.3356 17,570.67 5,267.52 2 5 7 7 21 6 1
8 8 34171 23.5205 9.1703 42,136.68 4,594.89 3 3 5 9 20 5 1
9 9 3924120 24.7933 1.1684 4,754.76 4,069.21 5 10 10 10 35 10 1
Please help me modify the above if statement or perhaps recommend a different approach altogether. This ofcourse needs to be as dynamic as possible.
So you want a custom ranking on your dataframe, where the cluster(/row) with the highest Ave. Spend is always ranked 1, and the one with the highest Ave. Recency always ranks last.
The solution is five lines. Notes:
You had the right idea with DataFrame.drop(), just use idxmax() to get the index of both of the rows that will need special treatment, and store it, so you don't need a huge unwieldy logical filter expression in your drop.
No need to make so many temporary columns, or the temporary copy ranks_2 = ranks.drop(...); just pass the result of the drop() into a rank() ...
... via a .sum(axis=1) on your desired columns, no need to define a lambda, or save its output in the temp column 'overall'.
...then we just feed those sum-of-ranks into rank(), which will give us values from 1..8, so we add 1 to offset the results of rank() to be 2..9. (You can generalize this).
And we manually set the 'overall_rank' for the Ave. Spend, Ave. Recency rows.
(Yes you could also implement all this as a custom function whose input is the four Ave. columns or else the four *_rank columns.)
Code: (see at bottom for boilerplate to read in your dataframe, next time please make your example MCVE, to help us help you)
# Compute raw ranks like you do
ranks['r_rank'] = ranks['Ave. Recency'].rank()
ranks['f_rank'] = ranks['Ave. Frequency'].rank(ascending=False)
ranks['m_rank'] = ranks['Ave. Monetary'].rank(ascending=False)
ranks['s_rank'] = ranks['Ave. Spend'].rank(ascending=False)
# Find the indices of both the highest AveSpend and AveRecency
ismax = ranks['Ave. Spend'].idxmax()
irmax = ranks['Ave. Recency'].idxmax()
# Get the overall ranking for every row other than these... add 1 to offset for excluding the max-AveSpend row:
ranks['overall_rank'] = 1 + ranks.drop(index = [ismax,irmax]) [['r_rank','f_rank','m_rank','s_rank']].sum(axis=1).rank(method='first')
# (Note: in .loc[], can't mix indices (ismax) with column-names)
ranks.loc[ ranks['Ave. Spend'].idxmax(), 'overall_rank' ] = 1
ranks.loc[ ranks['Ave. Recency'].idxmax(), 'overall_rank' ] = len(ranks)
And here's the boilerplate to ingest your data:
import pandas as pd
from io import StringIO
# """Cluster No. Customers| Ave. Recency| Ave. Frequency| Ave. Monetary| Ave. Spend|
dat = """
0 145118 39.0197 1.7264 14,971.85 8,672.07
1 2 47.0 19.0 237,270.00 12,487.89
2 1236 15.9716 24.9101 126,992.79 5,098.02
3 219847 41.9736 3.0682 17,701.64 5,769.23
4 9837 23.9330 3.2735 172,642.35 52,738.42
5 64865 24.8281 1.8599 13,159.21 7,075.19
6 3855 26.5647 3.9304 54,333.56 13,823.64
7 219549 17.7493 3.3356 17,570.67 5,267.52
8 34171 23.5205 9.1703 42,136.68 4,594.89
9 3924120 24.7933 1.1684 4,754.76 4,069.21 """
# Remove the comma thousands-separator, to prevent your floats being read in as string
dat = dat.replace(',', '')
ranks = pd.read_csv(StringIO(dat), sep='\s+', names=
"Cluster|No. Customers|Ave. Recency|Ave. Frequency|Ave. Monetary|Ave. Spend".split('|'))

How to create Traingular moving average in python using for loop

I use python pandas to caculate the following formula
(https://i.stack.imgur.com/XIKBz.png)
I do it in python like this :
EURUSD['SMA2']= EURUSD['Close']. rolling (2).mean()
EURUSD['TMA2']= ( EURUSD['Close'] + EURUSD[SMA2']) / 2
The proplem is long coding when i calculated TMA 100 , so i need to use " for loop " to easy change TMA period .
Thanks in advance
Edited :
I had found the code but there is an error :
values = []
for i in range(1,201): values.append(eurusd['Close']).rolling(window=i).mean() values.mean()
TMA is average of averages.
import numpy as np
import pandas as pd
df = pd.DataFrame(np.random.rand(10, 5))
print(df)
# df['mean0']=df.mean(0)
df['mean1']=df.mean(1)
print(df)
df['TMA'] = df['mean1'].rolling(window=10,center=False).mean()
print(df)
Or you can easily print it.
print(df["mean1"].mean())
Here is how it looks:
0 1 2 3 4
0 0.643560 0.412046 0.072525 0.618968 0.080146
1 0.018226 0.222212 0.077592 0.125714 0.595707
2 0.652139 0.907341 0.581802 0.021503 0.849562
3 0.129509 0.315618 0.711265 0.812318 0.757575
4 0.881567 0.455848 0.470282 0.367477 0.326812
5 0.102455 0.156075 0.272582 0.719158 0.266293
6 0.412049 0.527936 0.054381 0.587994 0.442144
7 0.063904 0.635857 0.244050 0.002459 0.423960
8 0.446264 0.116646 0.990394 0.678823 0.027085
9 0.951547 0.947705 0.080846 0.848772 0.699036
0 1 2 3 4 mean1
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581
0 1 2 3 4 mean1 TMA
0 0.643560 0.412046 0.072525 0.618968 0.080146 0.365449 NaN
1 0.018226 0.222212 0.077592 0.125714 0.595707 0.207890 NaN
2 0.652139 0.907341 0.581802 0.021503 0.849562 0.602470 NaN
3 0.129509 0.315618 0.711265 0.812318 0.757575 0.545257 NaN
4 0.881567 0.455848 0.470282 0.367477 0.326812 0.500397 NaN
5 0.102455 0.156075 0.272582 0.719158 0.266293 0.303313 NaN
6 0.412049 0.527936 0.054381 0.587994 0.442144 0.404901 NaN
7 0.063904 0.635857 0.244050 0.002459 0.423960 0.274046 NaN
8 0.446264 0.116646 0.990394 0.678823 0.027085 0.451842 NaN
9 0.951547 0.947705 0.080846 0.848772 0.699036 0.705581 0.436115

Multiplying data within columns python

I've been working on this all morning and for the life of me cannot figure it out. I'm sure this is very basic, but I've become so frustrated my mind is being clouded. I'm attempting to calculate the total return of a portfolio of securities at each date (monthly).
The formula is (1 + r1) * (1+r2) * (1+ r(t))..... - 1
Here is what I'm working with:
Adj_Returns = Adj_Close/Adj_Close.shift(1)-1
Adj_Returns['Risk Parity Portfolio'] = (Adj_Returns.loc['2003-01-31':]*Weights.shift(1)).sum(axis = 1)
Adj_Returns
SPY IYR LQD Risk Parity Portfolio
Date
2002-12-31 NaN NaN NaN 0.000000
2003-01-31 -0.019802 -0.014723 0.000774 -0.006840
2003-02-28 -0.013479 0.019342 0.015533 0.011701
2003-03-31 -0.001885 0.010015 0.001564 0.003556
2003-04-30 0.088985 0.045647 0.020696 0.036997
For example, with 2002-12-31 being base 100 for risk parity, I want 2003-01-31 to be 99.316 (100 * (1-0.006840)), 2003-02-28 to be 100.478 (99.316 * (1+ 0.011701)) so on and so forth.
Thanks!!
You want to use pd.DataFrame.cumprod
df.add(1).cumprod().sub(1).sum(1)
Consider the dataframe of returns df
np.random.seed([3,1415])
df = pd.DataFrame(np.random.normal(.025, .03, (10, 5)), columns=list('ABCDE'))
df
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 0.024191 0.034487 0.035463 0.046461 0.048123
2 0.006754 0.035572 0.014424 0.012524 -0.002347
3 0.020724 0.047405 -0.020125 0.043341 0.037007
4 -0.003783 0.069827 0.014605 -0.019147 0.056897
5 0.056890 0.042756 0.033886 0.001758 0.049944
6 0.069609 0.032687 -0.001997 0.036253 0.009415
7 0.026503 0.053499 -0.006013 0.053447 0.047013
8 0.062084 0.029664 -0.015238 0.029886 0.062748
9 0.048341 0.065248 -0.024081 0.019139 0.028955
We can see the cumulative return or total return is
df.add(1).cumprod().sub(1)
A B C D E
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521
1 -0.015641 0.020983 0.000139 0.001702 0.063343
2 -0.008993 0.057301 0.014565 0.014247 0.060847
3 0.011544 0.107423 -0.005853 0.058206 0.100105
4 0.007717 0.184750 0.008666 0.037944 0.162699
5 0.065046 0.235405 0.042847 0.039769 0.220768
6 0.139183 0.275786 0.040764 0.077464 0.232261
7 0.169375 0.344039 0.034505 0.135051 0.290194
8 0.241974 0.383909 0.018742 0.168973 0.371151
9 0.302013 0.474207 -0.005791 0.191346 0.410852
Plot it
df.add(1).cumprod().sub(1).plot()
Add sum of returns to new column
df.assign(Portfolio=df.add(1).cumprod().sub(1).sum(1))
A B C D E Portfolio
0 -0.038892 -0.013054 -0.034115 -0.042772 0.014521 -0.114311
1 0.024191 0.034487 0.035463 0.046461 0.048123 0.070526
2 0.006754 0.035572 0.014424 0.012524 -0.002347 0.137967
3 0.020724 0.047405 -0.020125 0.043341 0.037007 0.271425
4 -0.003783 0.069827 0.014605 -0.019147 0.056897 0.401777
5 0.056890 0.042756 0.033886 0.001758 0.049944 0.603835
6 0.069609 0.032687 -0.001997 0.036253 0.009415 0.765459
7 0.026503 0.053499 -0.006013 0.053447 0.047013 0.973165
8 0.062084 0.029664 -0.015238 0.029886 0.062748 1.184749
9 0.048341 0.065248 -0.024081 0.019139 0.028955 1.372626

How to randomly append "Yes/No" (ratio of 7:3) to a column in pandas dataframe?

I have a dataframe , which consists of three columns. And i want to append "Yes" or "No" to one of the column using python-pandas. Also the ratio between Yes:No is 7:3.
Had anyone tried this??
With numpy's random.choice:
df["new_column"] = np.random.choice(["Yes", "No"], len(df), p=[0.7, 0.3])
Note: np.random.choice consists of independent trials (unless you pass replace = False). In each trial, the probability of getting a "Yes" will be 0.7. In the end you might not end up exactly with a 70% ratio. However, with 2480500 rows this binomial distribution will approximate to a normal distribution with a mean 2480500 * 0.7 and a standard deviation sqrt(2480500 * 0.7 * 0.3). With +/-3 standard deviation (with 99.73% probability) you will end up with a ratio between (0.69913, 0.70087). But if you want exactly 70%, you can use pandas' sample as #EdChum suggested, I guess it has a correction factor.
You can use sample to achieve this:
In [11]:
df = pd.DataFrame(np.random.randn(20,3), columns=list('abc'))
df
Out[11]:
a b c
0 -0.267704 1.030417 -0.494542
1 -0.830801 0.421847 1.296952
2 -1.165387 -0.381976 -0.178988
3 -0.800799 -0.240998 -0.900573
4 0.855965 0.765313 -0.125862
5 1.153730 1.323783 -0.113135
6 0.242592 -2.137141 -0.230177
7 -0.451582 0.267415 1.006564
8 0.071916 0.476523 1.326859
9 -1.168084 0.250367 -1.235262
10 0.238183 0.391661 -1.177926
11 -1.153294 -0.304811 -0.955384
12 -0.984470 -0.351073 -1.155049
13 -2.068388 1.294905 0.892136
14 -0.196381 -1.083988 0.203369
15 -1.430208 0.859933 1.152462
16 -0.250452 0.824815 0.425096
17 1.051399 -1.199689 0.487980
18 0.688910 -0.664028 -0.097302
19 -0.355774 0.064857 0.003731
In [12]:
df.loc[df.index.to_series().sample(frac=0.7).index, 'new_col'] = 'Yes'
df['new_col'].fillna('No',inplace=True)
df
Out[12]:
a b c new_col
0 -0.267704 1.030417 -0.494542 Yes
1 -0.830801 0.421847 1.296952 Yes
2 -1.165387 -0.381976 -0.178988 No
3 -0.800799 -0.240998 -0.900573 No
4 0.855965 0.765313 -0.125862 No
5 1.153730 1.323783 -0.113135 Yes
6 0.242592 -2.137141 -0.230177 Yes
7 -0.451582 0.267415 1.006564 Yes
8 0.071916 0.476523 1.326859 No
9 -1.168084 0.250367 -1.235262 Yes
10 0.238183 0.391661 -1.177926 Yes
11 -1.153294 -0.304811 -0.955384 Yes
12 -0.984470 -0.351073 -1.155049 Yes
13 -2.068388 1.294905 0.892136 Yes
14 -0.196381 -1.083988 0.203369 No
15 -1.430208 0.859933 1.152462 Yes
16 -0.250452 0.824815 0.425096 Yes
17 1.051399 -1.199689 0.487980 Yes
18 0.688910 -0.664028 -0.097302 Yes
19 -0.355774 0.064857 0.003731 No
Basically you can call sample and pass param frac=0.7 and then use the index to mask the df and assign the 'yes' value and then call fillna to assign the 'no' values
import pandas as pd
import random
arr = ['Yes'] * 7 + ['No'] * 3
arr *= number_of_rows // 10
random.shuffle(arr)
df['column_name'] = arr
Quick and Dirty
pd.Series(np.random.rand(100)).apply(lambda x: 'Yes' if x < .7 else 'No')

Categories