Rolling Mean not being calculated on a new column - python

I have an issue on calculating the rolling mean for a column I added in the code. For some reason, it doesnt work on the column I added but works on a column from the original csv.
Original dataframe from the csv as follow:
Open High Low Last Change Volume Open Int
Time
09/20/19 98.50 99.00 98.35 98.95 0.60 3305.0 0.0
09/19/19 100.35 100.75 98.10 98.35 -2.00 17599.0 0.0
09/18/19 100.65 101.90 100.10 100.35 0.00 18258.0 121267.0
09/17/19 103.75 104.00 100.00 100.35 -3.95 34025.0 122453.0
09/16/19 102.30 104.95 101.60 104.30 1.55 21403.0 127447.0
Ticker = pd.read_csv('\\......\Historical data\kcz19 daily.csv',
index_col=0, parse_dates=True)
Ticker['Return'] = np.log(Ticker['Last'] / Ticker['Last'].shift(1)).fillna('')
Ticker['ret20'] = Ticker['Return'].rolling(window=20, win_type='triang').mean()
print(Ticker.head())
Open High Low ... Open Int Return ret20
Time ...
09/20/19 98.50 99.00 98.35 ... 0.0
09/19/19 100.35 100.75 98.10 ... 0.0 -0.00608213 -0.00608213
09/18/19 100.65 101.90 100.10 ... 121267.0 0.0201315 0.0201315
09/17/19 103.75 104.00 100.00 ... 122453.0 0 0
09/16/19 102.30 104.95 101.60 ... 127447.0 0.0386073 0.0386073
ret20 column should have the rolling mean of the column Return so it should show some data starting from raw 21 whereas it is only a copy of column Return here.
If I replace with the Last column it will work.
Below is the result using colum Last
Open High Low ... Open Int Return ret20
Time ...
09/20/19 98.50 99.00 98.35 ... 0.0 NaN
09/19/19 100.35 100.75 98.10 ... 0.0 -0.00608213 NaN
09/18/19 100.65 101.90 100.10 ... 121267.0 0.0201315 NaN
09/17/19 103.75 104.00 100.00 ... 122453.0 0 NaN
09/16/19 102.30 104.95 101.60 ... 127447.0 0.0386073 NaN
09/13/19 103.25 103.60 102.05 ... 128707.0 -0.0149725 NaN
09/12/19 102.80 103.85 101.15 ... 128904.0 0.00823848 NaN
09/11/19 102.00 104.70 101.40 ... 132067.0 -0.00193237 NaN
09/10/19 98.50 102.25 98.00 ... 135349.0 -0.0175614 NaN
09/09/19 97.00 99.25 95.30 ... 137347.0 -0.0335283 NaN
09/06/19 95.35 97.30 95.00 ... 135399.0 -0.0122889 NaN
09/05/19 96.80 97.45 95.05 ... 136142.0 -0.0171477 NaN
09/04/19 95.65 96.95 95.50 ... 134864.0 0.0125002 NaN
09/03/19 96.00 96.60 94.20 ... 134685.0 -0.0109291 NaN
08/30/19 95.40 97.20 95.10 ... 134061.0 0.0135137 NaN
08/29/19 97.05 97.50 94.75 ... 132639.0 -0.0166584 NaN
08/28/19 97.40 98.15 95.95 ... 130573.0 0.0238601 NaN
08/27/19 97.35 98.00 96.40 ... 129921.0 -0.00410889 NaN
08/26/19 95.55 98.50 95.25 ... 129003.0 0.0035962 NaN
08/23/19 96.90 97.40 95.05 ... 130268.0 -0.0149835 98.97775
Appreciate any help

the .fillna('') is creating a string in the first row which then creates errors for the rolling calculation in Ticker['ret20'].
Delete this and the code will run fine:
Ticker['Return'] = np.log(Ticker['Last'] / Ticker['Last'].shift(1))

Related

Using DataFrame Columns as id

Does anyone know how to transform this DataFrame in a way that the column names become a query ID (keeping the df length) and the values are flattened. I am trying to learn about 'learning to rank' algorithms. Thanks for the help.
AUD=X CAD=X CHF=X ... SGD=X THB=X ZAR=X
Date ...
2004-06-30 NaN 1.33330 1.25040 ... 1.72090 40.834999 6.12260
2004-07-01 NaN 1.33160 1.24900 ... 1.71420 40.716999 6.16500
2004-07-02 NaN 1.32270 1.23320 ... 1.71160 40.638000 6.12010
2004-07-05 NaN 1.32470 1.23490 ... 1.71480 40.658001 6.15010
2004-07-06 NaN 1.32660 1.23660 ... 1.71530 40.765999 6.20990
... ... ... ... ... ... ...
2021-07-19 1.352997 1.26169 0.91853 ... 1.35630 32.810001 14.38950
2021-07-20 1.362546 1.27460 0.91850 ... 1.36360 32.840000 14.53068
2021-07-21 1.362600 1.26751 0.92123 ... 1.36621 32.820000 14.59157
2021-07-22 1.360060 1.25689 0.91757 ... 1.36383 32.849998 14.57449
2021-07-23 1.354922 1.25640 0.91912 ... 1.35935 32.879002 14.69760
In [3]: df
Out[3]:
AUD=X CAD=X CHF=X SGD=X THB=X ZAR=X
Date
2004-06-30 NaN 1.3333 1.2504 1.7209 40.834999 6.1226
2004-07-01 NaN 1.3316 1.2490 1.7142 40.716999 6.1650
2004-07-02 NaN 1.3227 1.2332 1.7116 40.638000 6.1201
2004-07-05 NaN 1.3247 1.2349 1.7148 40.658001 6.1501
2004-07-06 NaN 1.3266 1.2366 1.7153 40.765999 6.2099
In [6]: df.columns = df.columns.str.slice(0, -2)
In [8]: df.T
Out[8]:
Date 2004-06-30 2004-07-01 2004-07-02 2004-07-05 2004-07-06
AUD NaN NaN NaN NaN NaN
CAD 1.333300 1.331600 1.3227 1.324700 1.326600
CHF 1.250400 1.249000 1.2332 1.234900 1.236600
SGD 1.720900 1.714200 1.7116 1.714800 1.715300
THB 40.834999 40.716999 40.6380 40.658001 40.765999
ZAR 6.122600 6.165000 6.1201 6.150100 6.209900
I'm still not super clear on the requirements, but this transformation might help.

List of dataframes with different column names to a single pandas dataframe

I have list of 3 dataframes of stock tickers and prices I want to convert into a single dataframe.
dataframes:
[ Date AMBU-B.CO BAVA.CO CARL-B.CO CHR.CO COLO-B.CO \
0 2020-01-02 112.500000 172.850006 984.400024 525.599976 814.000000
1 2020-01-03 111.300003 171.199997 989.799988 526.799988 812.000000
2 2020-01-06 108.150002 166.100006 1001.000000 519.599976 820.200012
3 2020-01-07 110.500000 170.000000 1002.000000 522.400024 823.599976
4 2020-01-08 109.599998 171.399994 993.000000 510.399994 820.000000
.. ... ... ... ... ... ...
308 2021-03-25 270.000000 295.200012 965.799988 562.599976 964.200012
309 2021-03-26 271.299988 302.000000 974.599976 548.599976 954.000000
310 2021-03-29 281.000000 294.000000 981.400024 575.000000 968.200012
311 2021-03-30 280.899994 282.600006 986.599976 567.400024 950.200012
312 2021-03-31 297.899994 286.399994 974.599976 576.400024 953.799988
DANSKE.CO DEMANT.CO DSV.CO FLS.CO ... NETC.CO \
0 110.349998 208.600006 769.799988 272.500000 ... 314.000000
1 107.900002 206.600006 751.400024 267.899994 ... 313.000000
2 106.699997 206.500000 752.400024 265.600006 ... 309.799988
3 107.750000 204.399994 753.799988 273.399994 ... 309.200012
4 108.250000 205.600006 755.799988 268.000000 ... 309.200012
.. ... ... ... ... ... ...
308 117.349998 260.399994 1170.000000 230.199997 ... 603.000000
309 120.050003 267.600006 1212.500000 237.800003 ... 603.500000
310 118.750000 267.100006 1206.000000 238.300003 ... 599.000000
311 120.500000 265.500000 1213.500000 243.600006 ... 592.000000
312 118.699997 268.700012 1244.500000 243.100006 ... 604.000000
NOVO-B.CO NZYM-B.CO ORSTED.CO PNDORA.CO RBREW.CO ROCK-B.CO \
0 388.700012 327.100006 681.000000 293.000000 603.000000 1584.0
1 383.200012 322.500000 677.400024 293.200012 605.200012 1567.0
2 382.049988 321.200012 670.200012 328.200012 601.599976 1547.0
3 381.700012 322.000000 662.000000 339.299988 612.200012 1546.0
4 382.500000 322.700012 645.000000 343.600006 602.200012 1531.0
.. ... ... ... ... ... ...
308 425.450012 403.399994 983.000000 655.799988 658.400024 2506.0
309 423.549988 404.100006 1013.500000 666.400024 666.599976 2672.0
310 431.549988 404.000000 1013.000000 678.400024 669.799988 2650.0
311 430.700012 401.500000 998.799988 678.400024 672.000000 2632.0
312 429.750000 406.299988 1024.500000 679.599976 663.400024 2674.0
SIM.CO TRYG.CO VWS.CO
0 776.0 196.399994 659.400024
1 764.5 195.600006 648.599976
2 751.5 195.000000 648.400024
3 753.5 200.000000 639.599976
4 762.0 197.500000 645.400024
.. ... ... ...
308 769.0 145.300003 1138.500000
309 775.5 146.500000 1187.000000
310 772.0 149.000000 1217.000000
311 781.0 149.800003 1245.000000
312 785.5 149.600006 1302.000000
[313 rows x 26 columns],
Date 1COV.DE ADS.DE ALV.DE BAS.DE BAYN.DE \
0 2020-01-02 42.180000 291.549988 221.500000 68.290001 73.519997
1 2020-01-03 41.900002 291.950012 219.050003 67.269997 72.580002
2 2020-01-06 39.889999 289.649994 217.699997 66.269997 71.739998
3 2020-01-07 40.130001 294.750000 218.199997 66.300003 72.129997
4 2020-01-08 40.830002 302.850006 218.300003 65.730003 74.000000
.. ... ... ... ... ... ...
314 2021-03-29 56.439999 264.100006 214.600006 70.029999 53.360001
315 2021-03-30 58.200001 265.000000 219.050003 71.879997 53.750000
316 2021-03-31 57.340000 266.200012 217.050003 70.839996 53.959999
317 2021-04-01 57.660000 267.950012 217.649994 71.629997 53.419998
318 2021-04-01 57.660000 267.950012 217.649994 71.629997 53.419998
BEI.DE BMW.DE CON.DE DAI.DE ... IFX.DE LIN.DE \
0 105.650002 74.220001 116.400002 49.974998 ... 20.684999 190.050003
1 105.650002 73.320000 113.980003 49.070000 ... 20.389999 185.300003
2 106.000000 73.050003 112.680000 48.805000 ... 20.045000 183.600006
3 105.750000 74.220001 115.120003 49.195000 ... 21.040001 185.300003
4 106.199997 74.410004 117.339996 49.470001 ... 21.309999 185.850006
.. ... ... ... ... ... ... ...
314 90.220001 85.599998 111.949997 73.709999 ... 34.880001 237.000000
315 90.040001 88.800003 113.449997 75.940002 ... 35.535000 238.500000
316 90.099998 88.470001 112.699997 76.010002 ... 36.154999 238.899994
317 90.500000 89.519997 112.760002 NaN ... 36.570000 238.699997
318 90.500000 89.519997 112.760002 74.970001 ... 36.570000 238.699997
MRK.DE MTX.DE MUV2.DE RWE.DE SAP.DE SIE.DE \
0 106.000000 258.100006 265.899994 26.959999 122.000000 118.639999
1 107.250000 257.799988 262.600006 26.840000 120.459999 116.360001
2 108.400002 258.000000 262.700012 26.450001 119.559998 115.820000
3 109.500000 262.299988 264.500000 27.049999 120.099998 116.559998
4 111.300003 263.000000 265.000000 27.170000 120.820000 117.040001
.. ... ... ... ... ... ...
314 145.949997 196.199997 260.200012 32.709999 104.300003 137.839996
315 145.949997 201.300003 265.000000 32.400002 103.559998 141.080002
316 145.800003 200.699997 262.600006 33.419998 104.419998 140.000000
317 145.800003 206.199997 266.049988 34.060001 106.000000 141.020004
318 145.800003 206.199997 266.049988 34.060001 106.000000 141.020004
VNA.DE VOW3.DE
0 48.419998 180.500000
1 48.599998 176.639999
2 48.450001 176.619995
3 48.709999 176.059998
4 48.970001 176.820007
.. ... ...
314 55.599998 229.750000
315 55.619999 240.550003
316 55.700001 238.600006
317 56.099998 235.850006
318 56.099998 235.850006
[319 rows x 31 columns],
Date ADE.OL AKRBP.OL BAKKA.OL DNB.OL EQNR.OL \
0 2020-01-02 106.800003 289.000000 664.0 165.800003 177.949997
1 2020-01-03 108.199997 292.899994 670.0 164.850006 180.949997
2 2020-01-06 107.000000 296.299988 654.0 164.899994 185.000000
3 2020-01-07 111.199997 295.700012 657.5 163.899994 183.000000
4 2020-01-08 108.800003 295.299988 668.5 166.000000 183.600006
.. ... ... ... ... ... ...
310 2021-03-25 133.000000 237.500000 633.0 178.050003 164.449997
311 2021-03-26 133.300003 244.199997 640.0 181.449997 167.649994
312 2021-03-29 131.100006 248.199997 660.0 182.000000 169.750000
313 2021-03-30 126.900002 244.800003 672.0 182.500000 168.600006
314 2021-03-31 125.900002 242.800003 677.5 182.000000 167.300003
GJF.OL LSG.OL MOWI.OL NAS.OL ... NHY.OL \
0 184.149994 59.240002 229.500000 4094.000000 ... 33.410000
1 185.100006 58.900002 229.800003 3986.000000 ... 32.660000
2 182.550003 59.000000 229.199997 3857.000000 ... 32.299999
3 184.600006 59.000000 227.199997 3964.000000 ... 32.220001
4 184.199997 59.700001 226.699997 3964.000000 ... 32.090000
.. ... ... ... ... ... ...
310 199.199997 70.680000 205.500000 53.299999 ... 50.060001
311 200.000000 71.959999 208.000000 53.020000 ... 53.080002
312 200.600006 73.099998 209.699997 55.000000 ... 53.060001
313 200.399994 73.419998 210.800003 60.759998 ... 53.419998
314 200.600006 73.099998 212.199997 66.400002 ... 54.759998
ORK.OL SALM.OL SCATC.OL SCHA.OL STB.OL SUBC.OL \
0 89.959999 454.000000 123.400002 271.299988 69.900002 105.900002
1 89.699997 453.899994 123.000000 272.100006 69.500000 107.150002
2 89.139999 453.500000 117.300003 268.299988 68.639999 108.150002
3 89.879997 447.700012 116.000000 272.299988 69.720001 107.699997
4 87.720001 451.799988 118.400002 271.899994 70.139999 107.250000
.. ... ... ... ... ... ...
310 84.000000 568.799988 235.000000 368.200012 81.779999 87.800003
311 84.400002 581.799988 237.600006 375.700012 83.860001 87.000000
312 84.839996 585.000000 244.600006 367.399994 84.540001 87.820000
313 84.800003 587.400024 246.399994 361.000000 85.400002 87.279999
314 83.839996 590.000000 258.600006 359.000000 86.139999 85.900002
TEL.OL TOM.OL YAR.OL
0 157.649994 287.799988 361.299988
1 158.800003 284.399994 356.000000
2 159.399994 280.000000 356.000000
3 156.850006 274.000000 351.399994
4 155.449997 278.600006 357.299988
.. ... ... ...
310 149.350006 376.200012 438.000000
311 149.050003 376.700012 444.000000
312 151.000000 378.500000 448.500000
313 150.600006 372.799988 447.200012
314 150.500000 370.299988 444.799988
[315 rows x 21 columns]]
I found out that to solve this one usually uses pd.concat, but this does not seem to work for me:
df = pd.concat(dataframes)
df
It seems to return a lot of NANs, and it should not. How to solve this? If it can help, all dataframes uses the same dates from 2020-01-02 to 2021-03-31.
Date AMBU-B.CO BAVA.CO CARL-B.CO CHR.CO COLO-B.CO DANSKE.CO DEMANT.CO DSV.CO FLS.CO ... NHY.OL ORK.OL SALM.OL SCATC.OL SCHA.OL STB.OL SUBC.OL TEL.OL TOM.OL YAR.OL
0 2020-01-02 112.500000 172.850006 984.400024 525.599976 814.000000 110.349998 208.600006 769.799988 272.500000 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 2020-01-03 111.300003 171.199997 989.799988 526.799988 812.000000 107.900002 206.600006 751.400024 267.899994 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 2020-01-06 108.150002 166.100006 1001.000000 519.599976 820.200012 106.699997 206.500000 752.400024 265.600006 ... NaN NaN NaN
EDIT: here is how dataframes are created to start with:
def motor_daily(ticker_list):
#function uses start and end dates to get closing prices for certain stocks.
df = yf.download(ticker_list, start=phase_2.start(),
end=phase_2.tomorrow()).Close
return df
def ticker_data(list):
#function takes "ticks" which is a list of ticker names and uses
#motor_daily_big_function to get data frame yahoo API
data = []
for ticks in list:
data.append(motor_daily(ticks))
return data
res = ticker_data(list_of_test)
dataframes = [pd.DataFrame(lst) for lst in res]
I fixed it myself, here is what I did:
dataframes_concat = pd.concat(dataframes)
df1 = dataframes_concat.groupby('Date', as_index=True).first()
print(df1)
AMBU-B.CO BAVA.CO CARL-B.CO CHR.CO COLO-B.CO DANSKE.CO DEMANT.CO DSV.CO FLS.CO GMAB.CO ... NHY.OL ORK.OL SALM.OL SCATC.OL SCHA.OL STB.OL SUBC.OL TEL.OL TOM.OL YAR.OL
Date
2020-01-02 112.500000 172.850006 984.400024 525.599976 814.000000 110.349998 208.600006 769.799988 272.500000 1486.5 ... 33.410000 89.959999 454.000000 123.400002 271.299988 69.900002 105.900002 157.649994 287.799988 361.299988
2020-01-03 111.300003 171.199997 989.799988 526.799988 812.000000 107.900002 206.600006 751.400024 267.899994 1444.5 ... 32.660000 89.699997 453.899994 123.000000 272.100006 69.500000 107.150002 158.800003 284.399994 356.000000
2020-01-06 108.150002 166.100006 1001.000000 519.599976 820.200012 106.699997 206.500000 752.400024 265.600006 1419.5 ... 32.299999 89.139999 453.500000 117.300003 268.299988 68.639999 108.150002 159.399994 280.000000 356.000000
2020-01-07 110.500000 170.000000 1002.000000 522.400024 823.599976 107.750000 204.399994 753.799988 273.399994 1456.0 ... 32.220001 89.879997 447.700012 116.000000 272.299988 69.720001 107.699997 156.850006 274.000000 351.399994
2020-01-08 109.599998 171.399994 993.000000 510.399994 820.000000 108.250000 205.600006 755.799988 268.000000 1466.5 ... 32.090000 87.720001 451.799988 118.400002 271.899994 70.139999 107.250000 155.449997 278.600006 357.299988
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2021-03-26 271.299988 302.000000 974.599976 548.599976 954.000000 120.050003 267.600006 1212.500000 237.800003 2045.0 ... 53.080002 84.400002 581.799988 237.600006 375.700012 83.860001 87.000000 149.050003 376.700012 444.000000
2021-03-29 281.000000 294.000000 981.400024 575.000000 968.200012 118.750000 267.100006 1206.000000 238.300003 2028.0 ... 53.060001 84.839996 585.000000 244.600006 367.399994 84.540001 87.820000 151.000000 378.500000 448.500000
2021-03-30 280.899994 282.600006 986.599976 567.400024 950.200012 120.500000 265.500000 1213.500000 243.600006 2019.0 ... 53.419998 84.800003 587.400024 246.399994 361.000000 85.400002 87.279999 150.600006 372.799988 447.200012
2021-03-31 297.899994 286.399994 974.599976 576.400024 953.799988 118.699997 268.700012 1244.500000 243.100006 2087.0 ... 54.759998 83.839996 590.000000 258.600006 359.000000 86.139999 85.900002 150.500000 370.299988 444.799988
2021-04-01 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
last row is NAN as markets are closed for easter.

Selecting Rows in pandas multiindex using query

I have the following dataframe:
Attributes Adj Close
Symbols ADANIPORTS.NS ASIANPAINT.NS AXISBANK.NS BAJAJ-AUTO.NS BAJFINANCE.NS BAJAJFINSV.NS BHARTIARTL.NS INFRATEL.NS BPCL.NS BRITANNIA.NS ... TCS.NS TATAMOTORS.NS TATASTEEL.NS TECHM.NS TITAN.NS ULTRACEMCO.NS UPL.NS VEDL.NS WIPRO.NS ZEEL.NS
month day
1 1 279.239893 676.232860 290.424052 2324.556588 974.134152 3710.866499 290.157978 243.696764 146.170036 950.108271 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 240.371331 507.737111 236.844831 2340.821987 718.111446 3042.034076 277.125503 236.177303 122.136606 733.759396 ... -2.714824 2.830603 109.334502 -17.856865 13.293902 18.980020 0.689529 -0.006994 -3.862265 -10.423989
3 241.700116 498.997079 213.632179 2368.956136 746.050460 3292.162304 279.075750 231.213816 114.698633 686.986466 ... 0.075497 -0.629591 -0.241416 -0.260787 1.392858 -1.196444 -0.660421 -0.161608 -0.243293 -1.687734
4 223.532480 439.849441 201.245454 2391.910913 499.554044 2313.025635 287.582485 276.568762 104.650728 603.446742 ... -1.270405 0.178012 0.109399 -0.224380 -0.415277 -5.050810 -0.084462 -0.075032 3.924894 0.959136
5 213.588413 359.632790 187.594303 2442.596619 309.180993 1587.324934 260.401816 305.384079 95.571235 475.708696 ... -0.995601 -1.093621 0.214684 -1.189623 -2.503186 -0.511994 -0.512211 0.693024 -1.025715 -1.516946
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
12 27 238.901700 500.376711 227.057510 2413.230611 748.599821 3299.320564 276.806537 242.597250 124.235449 727.263012 ... 2.770155 -4.410527 -0.031403 -5.315438 -1.792164 1.038870 -0.860125 -1.258880 -0.933370 -1.487581
28 236.105050 461.535601 218.893424 2375.671582 542.521903 2613.480190 284.374906 264.309625 117.807956 681.625725 ... -0.614677 -1.045941 0.688749 -0.375988 1.848569 -1.362454 37.301528 4.794349 -21.079648 -2.224608
29 215.606034 372.030459 203.876520 2450.112244 324.772498 1765.010912 257.278008 300.096024 108.679112 543.112336 ... 3.220893 -28.873421 0.197491 0.649738 0.737047 -6.121189 -1.165286 0.197648 0.250269 -0.064486
30 205.715512 432.342895 235.872734 2279.715479 515.535031 2164.257183 237.584375 253.401642 116.322402 634.503822 ... -1.190093 0.111826 -1.100066 -0.274475 -1.107278 -0.638013 -7.148901 -0.594369 -0.622608 0.368726
31 222.971462 490.784491 246.348255 2211.909688 670.891505 2671.694809 260.623987 230.032092 108.617400 719.389436 ... -1.950700 0.994181 -11.328524 -1.575859 -8.297147 1.151578 -0.059656 -0.650074 -0.648105 -0.749307
366 rows × 601 columns
To select the row which is month 1 and day 1 i have used the following code:
df.query('month ==1' and 'day ==1')
But this produced the following dataframe:
Attributes Adj Close
Symbols ADANIPORTS.NS ASIANPAINT.NS AXISBANK.NS BAJAJ-AUTO.NS BAJFINANCE.NS BAJAJFINSV.NS BHARTIARTL.NS INFRATEL.NS BPCL.NS BRITANNIA.NS ... TCS.NS TATAMOTORS.NS TATASTEEL.NS TECHM.NS TITAN.NS ULTRACEMCO.NS UPL.NS VEDL.NS WIPRO.NS ZEEL.NS
month day
1 1 279.239893 676.232860 290.424052 2324.556588 974.134152 3710.866499 290.157978 243.696764 146.170036 950.108271 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 1 215.752040 453.336287 213.741552 2373.224390 517.295897 2289.618629 280.212598 253.640594 104.505893 620.435294 ... -2.526060 -1.059128 -2.052233 3.941005 25.233763 -41.377432 1.032536 7.398859 -4.622867 -1.506376
3 1 233.534958 472.889636 204.900776 2318.030298 561.193189 2697.357413 254.006857 250.426263 106.528327 649.475321 ... -2.269081 -1.375370 -1.734496 27.675276 -1.944131 0.401074 -0.852499 -0.119033 -1.723600 -1.930760
4 1 192.280787 467.604906 227.369618 1982.318034 506.188324 1931.920305 252.626459 226.062386 98.663596 637.086713 ... -0.044923 -0.111909 -0.181328 -1.943672 1.983368 -1.677000 -0.531217 0.032385 -0.956535 -2.015332
5 1 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000 -1.000000
6 1 230.836429 509.991614 218.370072 2463.180957 526.564244 2231.603166 289.425584 298.146594 118.566019 754.736115 ... -0.807933 -1.509616 1.792957 10.396550 -1.060003 2.008286 1.029651 6.690478 -3.114476 0.766063
7 1 197.943186 355.930544 242.388461 2168.834937 412.196744 1753.647647 233.189894 241.823186 90.870574 512.000742 ... -1.630295 11.019253 -0.244958 2.188104 -0.505939 -0.564639 -1.747775 -0.394980 -2.736355 -0.140087
8 1 236.361903 491.867703 218.289537 2102.183175 657.764627 2792.688073 264.695685 249.063224 108.213277 662.192035 ... -1.655988 -1.555488 -1.199192 -0.565774 -1.831832 -4.770262 -0.442534 -6.168488 -0.267261 -3.324977
9 1 229.131335 372.101859 225.172708 2322.747894 333.243305 1800.901049 246.923254 287.262203 114.754666 562.854895 ... -2.419973 0.205031 -1.096847 -0.840121 -2.932670 1.719342 6.196965 -2.674245 -6.542936 -2.526353
10 1 208.748352 429.829772 222.081509 2095.421448 553.005620 2204.335371 259.718945 229.177512 102.475334 641.439810 ... 0.752312 -1.371583 -1.367145 -5.607321 3.259092 26.787332 -1.023199 -0.589042 0.507405 2.428903
11 1 248.233805 545.774276 241.743095 2390.945333 803.738236 3088.686081 277.757322 243.703551 131.933623 789.243830 ... -1.882445 -0.660089 -0.476966 -1.097497 -0.525270 -0.857579 -0.702017 0.016806 -0.792296 -0.368364
12 1 200.472858 353.177721 200.870312 2451.274841 295.858735 1556.379498 255.714673 301.000198 103.908244 514.528562 ... -0.789445 -14.382776 0.196276 -0.394203 7.600042 48.345830 -0.276618 -0.411825 2.271997 42.734886
12 rows × 601 columns
it has produced day 1 for each month instead of row which will show month 1 and day 1. What can i do to resolve this issue?
Remove one '' for one string:
df.query('month == 1 and day == 1')

How to randomly create a preference dataframe from a dataframe of choices?

I have a Dataframe of vote and I would like to create one of preferences.
For example here is the number of votes for each party P1, P2, P3 in each city Comm, Comm2 ...
Comm Votes P1 P2 P3
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
These electoral results would suffice for a first pass the ballot system, I would like to test the alternative election model. So for each political party I need to get the preferences.
As I don't know the preferences, I want to make them with random numbers. I suppose that voters are honest. For example, for the "P1" party in town "comm" We know that 2 people voted for it and that there are 1315 voters. I need to create preferences to see if people would put it as their first, second or third option. It is to say, and for each party:
Comm Votes P1_1 P1_2 P1_3 P2_1 P2_2 P2_3 P3_1 P3_2 P3_3
0 comm1 1315.0 2.0 1011.0 303.0 424.0 881.0 10.0 570.0 1.0 1.0
... ... ... ... ... ...
1526 comm1527 1691.0 331.0 1300.0 60.0 299.0 22.0 10.0 ...
So I have to do:
# for each column in parties I create (parties -1) other columns
# I rename them all Party_i. The former 1 becomes Party_1.
# In the other columns I put a random number.
# For a given line, the sum of all Party_i for i in [1, parties] mus t be equal to Votes
I tried this so far:
parties = [item for item in df.columns if item not in ['Comm','Votes']]
for index, row in df_test.iterrows():
# In the other columns I put a random number.
for party in parties:
# for each column in parties I create (parties -1) other columns
for i in range(0,len(parties) -1):
print(random.randrange(0, row['Votes']))
# I rename them all Party_i. The former 1 becomes Party_1.
row["{party}_{preference}".format(party = party,preference = i)] = random.randrange(0, row['Votes']) if (row[party] < row['Votes']) else 0 # false because the sum of the votes isn't = to df['Votes']
The results are:
Comm Votes ... P1_1 P1_2 P1_3 P2_1 P2_2 P2_3 P3_1 P3_2 P3_3
0 comm1 1315.0 ... 1003 460 1588 1284 1482 1613 1429 345
1 comm2 1691.0 ... 1003 460 1588 1284 1482 1613 ...
...
But:
the numbers are the same for each rows
the value in row of Pi_1 isn't equal to the one in the row of Pi (Pi being a given party).
the sum of Pi_j for all j in [0, parties] isn't equal to the number in the column Votes
Update
I tried Antihead's answer with his own data and it worked well. But when apllying to my own data it doesn't. It leaves me an empty dataframe:
import collections
def fill_cells(cell):
v_max = cell['Votes']
all_dict = {}
#iterate over parties.copy()
for p in parties:
tmp_l = parties.copy()
tmp_l.remove(p)
# sample new data with equal choices
sampled = np.random.choice(tmp_l, int(v_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
# batch update of the dictio~nary keys
all_dict.update(
dict(zip([p+'_%s' %k[1] for k in c_sampled.keys()], c_sampled.values()))
)
return pd.Series(all_dict)
Indeed, with the following dataframe:
Comm Votes LPC CPC BQ
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1522 comm1523 23808.0 1588.0 4458.0 13147.0
1523 comm1524 639.0 40.0 126.0 40.0
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
I have an empty dataframe:
0
1
2
3
4
...
1522
1523
1524
1525
1526
Does this work:
# data
columns = ['Comm', 'Votes', 'P1', 'P2', 'P3']
data =[['comm1', 1315.0, 2.0, 424.0, 572.0],
['comm2', 4682.0, 117.0, 2053.0, 1584.0],
['comm3', 2397.0, 2.0, 40.0, 192.0],
['comm4', 931.0, 2.0, 12.0, 345.0],
['comm5', 842.0, 47.0, 209.0, 76.0],
['comm1525', 10477.0, 13.0, 673.0, 333.0],
['comm1526', 2674.0, 1.0, 55.0, 194.0],
['comm1527', 1691.0, 331.0, 29.0, 78.0]]
df =pd.DataFrame(data=data, columns=columns)
import collections
def fill_cells(cell):
v_max = cell['Votes']
all_dict = {}
#iterate over parties
for p in ['P1', 'P2', 'P3']:
tmp_l = ['P1', 'P2', 'P3']
tmp_l.remove(p)
# sample new data with equal choices
sampled = np.random.choice(tmp_l, int(v_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
# batch update of the dictionary keys
all_dict.update(
dict(zip([p+'_%s' %k[1] for k in c_sampled.keys()], c_sampled.values()))
)
return pd.Series(all_dict)
# get back a data frame
df.apply(fill_cells, axis=1)
If You need to merge the data frame back, do something like:
new_df = df.apply(fill_cells, axis=1)
pd.concat([df, new_df], axis=1)
Based on Antihead's answer and for the following dataset:
Comm Votes LPC CPC BQ
0 comm1 1315.0 2.0 424.0 572.0
1 comm2 4682.0 117.0 2053.0 1584.0
2 comm3 2397.0 2.0 40.0 192.0
3 comm4 931.0 2.0 12.0 345.0
4 comm5 842.0 47.0 209.0 76.0
... ... ... ... ... ...
1522 comm1523 23808.0 1588.0 4458.0 13147.0
1523 comm1524 639.0 40.0 126.0 40.0
1524 comm1525 10477.0 13.0 673.0 333.0
1525 comm1526 2674.0 1.0 55.0 194.0
1526 comm1527 1691.0 331.0 29.0 78.0
I tried:
def fill_cells(cell):
votes_max = cell['Votes']
all_dict = {}
#iterate over parties
parties_temp = parties.copy()
for p in parties_temp:
preferences = ['1','2','3']
for preference in preferences:
preferences.remove(preference)
# sample new data with equal choices
sampled = np.random.choice(preferences, int(votes_max-cell[p]))
# transform into dictionary
c_sampled = dict(collections.Counter(sampled))
c_sampled.update({p:cell[p]})
c_sampled['1'] = c_sampled.pop(p)
# batch update of the dictionary keys
all_dict.update(
dict(zip([p+'_%s' %k for k in c_sampled.keys()],c_sampled.values()))
)
return pd.Series(all_dict)
It returns
LPC_2 LPC_3 LPC_1 CPC_2 CPC_3 CPC_1 BQ_2 BQ_3 BQ_1
0 891.0 487.0 424.0 743.0 373.0 572.0 1313.0 683.0 2.0
1 2629.0 1342.0 2053.0 3098.0 1603.0 1584.0 4565.0 2301.0 117.0
2 2357.0 1186.0 40.0 2205.0 1047.0 192.0 2395.0 1171.0 2.0
3 919.0 451.0 12.0 586.0 288.0 345.0 929.0 455.0 2.0
4 633.0 309.0 209.0 766.0 399.0 76.0 795.0 396.0 47.0
... ... ... ... ... ... ... ... ... ...
1520 1088.0 536.0 42.0 970.0 462.0 160.0 1117.0 540.0 13.0
1521 4742.0 2341.0 219.0 3655.0 1865.0 1306.0 4705.0 2375.0 256.0
1522 19350.0 9733.0 4458.0 10661.0 5352.0 13147.0 22220.0 11100.0 1588.0
1523 513.0 264.0 126.0 599.0 267.0 40.0 599.0 306.0 40.0
1524 9804.0 4885.0 673.0 10144.0 5012.0 333.0 10464.0 5162.0 13.0
It's almost good. I would have prefered the preferences to be dynamically encoded rather than to hard code ['1','2','3'].

Python - faster way to run a for loop in a dataframe

I am running the following code to calculate for every dataframe row the number of positive days in the previous rows and the number of days in which the stock has beaten the S&P 500 index:
for offset in [1,5,15,30,45,60,75,90,120,150,
200,250,500,750,1000,1250,1500]:
asset['return_stock'] = (asset.Close - asset.Close.shift(1)) / (asset.Close.shift(1))
merged_data = pd.merge(asset, sp_500, on='Date')
total_positive_days=0
total_beating_sp_days=0
for index, row in merged_data.iterrows():
print(offset, index)
for i in range(0,offset):
if index-i-1>0:
if merged_data.loc[index-i,'Close_x'] > merged_data.loc[index-i-1,'Close_x']:
total_positive_days+=1
if merged_data.loc[index-i,'return_stock'] > merged_data.loc[index-i-1,'return_sp']:
total_beating_sp_days+=1
but it is quite slow. Is there a way to speed it up (possibly by somehow getting rid of the for loop)?
My dataset looks like this (merged_data follows):
Date Open_x High_x Low_x Close_x Adj Close_x Volume_x return_stock Pct_positive_1 Pct_beating_1 Pct_change_1 Pct_change_plus_1 Pct_positive_5 Pct_beating_5 Pct_change_5 Pct_change_plus_5 Pct_positive_15 Pct_beating_15 Pct_change_15 Pct_change_plus_15 Pct_positive_30 Pct_beating_30 Pct_change_30 Pct_change_plus_30 Open_y High_y Low_y Close_y Adj Close_y Volume_y return_sp
0 2010-01-04 30.490000 30.642857 30.340000 30.572857 26.601469 123432400 NaN 1311.0 1261.0 NaN -0.001726 1310.4 1260.8 NaN 0.018562 1307.2 1257.6 NaN 0.039186 1302.066667 1252.633333 NaN 0.056579 1116.560059 1133.869995 1116.560059 1132.989990 1132.989990 3991400000 0.016043
1 2010-01-05 30.657143 30.798571 30.464285 30.625713 26.647457 150476200 0.001729 1311.0 1261.0 0.001729 0.016163 1310.4 1260.8 NaN 0.032062 1307.2 1257.6 NaN 0.031268 1302.066667 1252.633333 NaN 0.056423 1132.660034 1136.630005 1129.660034 1136.520020 1136.520020 2491020000 0.003116
2 2010-01-06 30.625713 30.747143 30.107143 30.138571 26.223597 138040000 -0.015906 1311.0 1261.0 -0.015906 0.001852 1310.4 1260.8 NaN 0.001519 1307.2 1257.6 NaN 0.058608 1302.066667 1252.633333 NaN 0.046115 1135.709961 1139.189941 1133.949951 1137.140015 1137.140015 4972660000 0.000546
3 2010-01-07 30.250000 30.285715 29.864286 30.082857 26.175119 119282800 -0.001849 1311.0 1261.0 -0.001849 -0.006604 1310.4 1260.8 NaN 0.005491 1307.2 1257.6 NaN 0.096428 1302.066667 1252.633333 NaN 0.050694 1136.270020 1142.459961 1131.319946 1141.689941 1141.689941 5270680000 0.004001
4 2010-01-08 30.042856 30.285715 29.865715 30.282858 26.349140 111902700 0.006648 1311.0 1261.0 0.006648 0.008900 1310.4 1260.8 NaN 0.029379 1307.2 1257.6 NaN 0.088584 1302.066667 1252.633333 NaN 0.075713 1140.520020 1145.390015 1136.219971 1144.979980 1144.979980 4389590000 0.002882
asset follows:
Date Open High Low Close Adj Close Volume return_stock Pct_positive_1 Pct_beating_1 Pct_change_1 Pct_change_plus_1 Pct_positive_5 Pct_beating_5 Pct_change_5 Pct_change_plus_5
0 2010-01-04 30.490000 30.642857 30.340000 30.572857 26.601469 123432400 NaN 1311.0 1261.0 NaN -0.001726 1310.4 1260.8 NaN 0.018562
1 2010-01-05 30.657143 30.798571 30.464285 30.625713 26.647457 150476200 0.001729 1311.0 1261.0 0.001729 0.016163 1310.4 1260.8 NaN 0.032062
2 2010-01-06 30.625713 30.747143 30.107143 30.138571 26.223597 138040000 -0.015906 1311.0 1261.0 -0.015906 0.001852 1310.4 1260.8 NaN 0.001519
3 2010-01-07 30.250000 30.285715 29.864286 30.082857 26.175119 119282800 -0.001849 1311.0 1261.0 -0.001849 -0.006604 1310.4 1260.8 NaN 0.005491
4 2010-01-08 30.042856 30.285715 29.865715 30.282858 26.349140 111902700 0.006648 1311.0 1261.0 0.006648 0.008900 1310.4 1260.8 NaN 0.029379
sp_500 follows:
Date Open High Low Close Adj Close Volume return_sp
0 1999-12-31 1464.469971 1472.420044 1458.189941 1469.250000 1469.250000 374050000 NaN
1 2000-01-03 1469.250000 1478.000000 1438.359985 1455.219971 1455.219971 931800000 -0.009549
2 2000-01-04 1455.219971 1455.219971 1397.430054 1399.420044 1399.420044 1009000000 -0.038345
3 2000-01-05 1399.420044 1413.270020 1377.680054 1402.109985 1402.109985 1085500000 0.001922
4 2000-01-06 1402.109985 1411.900024 1392.099976 1403.449951 1403.449951 1092300000 0.000956
This is a partial answer.
I think the way you do
asset.Close - asset.Close.shift(1)
at the top is key to how you might do this. Instead of
if merged_data.loc[index-i,'Close_x'] > merged_data.loc[index-i-1,'Close_x']
create a column with the change in Close_x:
merged_data['Delta_Close_x'] = merged_data.Close_x - merged_data.Close_x.shift(1)
Similarly,
if merged_data.loc[index-i,'return_stock'] > merged_data.loc[index-i-1,'return_sp']
becomes
merged_data['vs_sp'] = merged_data.return_stock - merged_data.return_sp.shift(1)
Then you can iterate i and use subsets like
merged_data[merged_data['Delta_Close_x'] > 0 and merged_data['vs_sp'] > 0]
There are a lot of additional details to work out, but I hope this gets you started.

Categories