if statement in for loop with pandas dataframes - python

I am making a Dollar Cost Average code where I want to choose between 2 equations. I made an excel spreadsheet that I'm trying to portover to python. I've gotten pretty far except for the last step. The last step has had me searching for a solution for 3 weeks now. The errors happen when I try a for loop in a df when looping through. I would like to check a column with an if the statement. If is true then do an equation if false do another equation. I can get the for loop to work and I can the if statements to work, but not combined. See all commented out code for whats been tried. I have tried np.where instead of the if statements as well. I have tried .loc. I have tried lamda. I have tried list comp. Nothing is working please help. FYI the code referring is ['trend bal'] column. ***see end with correct code.
What the df looks like:
Index timestamp Open High Low ... rate account bal invested ST_10_1.0 if trend
0 0 8/16/2021 4382.439941 4444.350098 4367.729980 ... 1.000000 $10,000.00 10000 1 0
1 1 8/23/2021 4450.290039 4513.330078 4450.290039 ... 0.015242 $10,252.42 10100 1 0
2 2 8/30/2021 4513.759766 4545.850098 4513.759766 ... 0.005779 $10,411.67 10200 1 0
3 3 9/6/2021 4535.379883 4535.379883 4457.660156 ... -0.016944 $10,335.25 10300 1 0
4 4 9/13/2021 4474.810059 4492.990234 4427.759766 ... -0.005739 $10,375.93 10400 1 0
5 5 9/20/2021 4402.950195 4465.399902 4305.910156 ... 0.005073 $10,528.57 10500 1 0
6 6 9/27/2021 4442.120117 4457.299805 4288.520020 ... -0.022094 $10,395.95 10600 1 0
7 7 10/4/2021 4348.839844 4429.970215 4278.939941 ... 0.007872 $10,577.79 10700 1 0
8 8 10/11/2021 4385.439941 4475.819824 4329.919922 ... 0.018225 $10,870.57 10800 1 0
9 9 10/18/2021 4463.720215 4559.669922 4447.470215 ... 0.016445 $11,149.33 10900 1 0
10 10 10/25/2021 4553.689941 4608.080078 4537.359863 ... 0.013307 $11,397.70 11000 1 0
11 11 11/1/2021 4610.620117 4718.500000 4595.060059 ... 0.020009 $11,725.75 11100 1 0
12 12 11/8/2021 4701.479980 4714.919922 4630.859863 ... -0.003125 $11,789.11 11200 1 0
13 13 11/15/2021 4689.299805 4717.750000 4672.779785 ... 0.003227 $11,927.15 11300 1 0
14 14 11/22/2021 4712.000000 4743.830078 4585.430176 ... -0.021997 $11,764.79 11400 1 0
15 15 11/29/2021 4628.750000 4672.950195 4495.120117 ... -0.012230 $11,720.92 11500 -1 100
16 16 12/6/2021 4548.370117 4713.569824 4540.509766 ... 0.038249 $12,269.23 11600 -1 100
17 17 12/13/2021 4710.299805 4731.990234 4600.220215 ... -0.019393 $12,131.29 11700 1 0
18 18 12/20/2021 4587.899902 4740.740234 4531.100098 ... 0.022757 $12,507.36 11800 1 0
19 19 12/27/2021 4733.990234 4808.930176 4733.990234 ... 0.008547 $12,714.25 11900 1 0
20 20 1/3/2022 4778.140137 4818.620117 4662.740234 ... -0.018705 $12,576.44 12000 1 0
21 21 1/10/2022 4655.339844 4748.830078 4582.240234 ... -0.003032 $12,638.31 12100 1 0
22 22 1/17/2022 4632.240234 4632.240234 4395.339844 ... -0.056813 $12,020.29 12200 1 0
23 23 1/24/2022 4356.319824 4453.229980 4222.620117 ... 0.007710 $12,212.97 12300 -1 100
24 24 1/31/2022 4431.790039 4595.310059 4414.020020 ... 0.015497 $12,502.23 12400 -1 100
25 25 2/7/2022 4505.750000 4590.029785 4401.410156 ... -0.018196 $12,374.75 12500 1 0
26 26 2/14/2022 4412.609863 4489.549805 4327.220215 ... -0.015790 $12,279.35 12600 1 0
27 27 2/21/2022 4332.740234 4385.339844 4114.649902 ... 0.008227 $12,480.38 12700 1 0
28 28 2/28/2022 4354.169922 4416.779785 4279.540039 ... -0.012722 $12,421.61 12800 1 0
29 29 3/7/2022 4327.009766 4327.009766 4157.870117 ... -0.028774 $12,164.19 12900 -1 100
30 30 3/14/2022 4202.750000 4465.399902 4161.720215 ... 0.061558 $13,012.99 13000 -1 100
31 31 3/21/2022 4462.399902 4546.029785 4424.299805 ... 0.017911 $13,346.07 13100 1 0
32 32 3/28/2022 4541.089844 4637.299805 4507.569824 ... 0.000616 $13,454.30 13200 1 0
33 33 4/4/2022 4547.970215 4593.450195 4450.040039 ... -0.012666 $13,383.88 13300 1 0
34 34 4/11/2022 4462.640137 4471.000000 4381.339844 ... -0.021320 $13,198.53 13400 1 0
35 35 4/18/2022 4385.629883 4512.939941 4267.620117 ... -0.027503 $12,935.53 13500 -1 100
36 36 4/25/2022 4255.339844 4308.450195 4124.279785 ... -0.032738 $12,612.05 13600 -1 100
37 37 5/2/2022 4130.609863 4307.660156 4062.510010 ... -0.002079 $12,685.83 13700 -1 100
38 38 5/9/2022 4081.270020 4081.270020 3858.870117 ... -0.024119 $12,479.86 13800 -1 100
39 39 5/16/2022 4013.020020 4090.719971 3810.320068 ... -0.030451 $12,199.84 13900 -1 100
40 40 5/23/2022 3919.419922 4158.490234 3875.129883 ... 0.065844 $13,103.12 14000 -1 100
41 41 5/30/2022 4151.089844 4177.509766 4073.850098 ... -0.011952 $13,046.51 14100 1 0
42 42 6/6/2022 4134.720215 4168.779785 3900.159912 ... -0.050548 $12,487.03 14200 1 0
43 43 6/13/2022 3838.149902 3838.149902 3636.870117 ... -0.057941 $11,863.52 14300 -1 100
44 44 6/20/2022 3715.310059 3913.649902 3715.310059 ... 0.064465 $12,728.31 14400 -1 100
45 45 6/27/2022 3920.760010 3945.860107 3738.669922 ... -0.022090 $12,547.14 14500 -1 100
46 46 7/4/2022 3792.610107 3918.500000 3742.060059 ... 0.019358 $12,890.03 14600 -1 100
47 47 7/11/2022 3880.939941 3880.939941 3721.560059 ... -0.009289 $12,870.29 14700 -1 100
48 48 7/18/2022 3883.790039 4012.439941 3818.629883 ... 0.025489 $13,298.35 14800 -1 100
49 49 7/25/2022 3965.719971 4140.149902 3910.739990 ... 0.042573 $13,964.51 14900 1 0
50 50 8/1/2022 4112.379883 4167.660156 4079.810059 ... 0.003607 $14,114.88 15000 1 0
51 51 8/8/2022 4155.930176 4280.470215 4112.089844 ... 0.032558 $14,674.44 15100 1 0
52 52 8/15/2022 4269.370117 4325.279785 4253.080078 ... 0.000839 $14,786.75 15200 1 0
53 53 8/19/2022 4266.310059 4266.310059 4218.700195 ... -0.012900 $14,696.00 15300 1 0
What it should look like:
Index timestamp Open High Low ... account bal invested ST_10_1.0 if trend trend bal
0 0 8/16/2021 4382.439941 4444.350098 4367.729980 ... $10,000.00 10000 1 0 $10,000.00
1 1 8/23/2021 4450.290039 4513.330078 4450.290039 ... $10,252.42 10100 1 0 $10,252.42
2 2 8/30/2021 4513.759766 4545.850098 4513.759766 ... $10,411.67 10200 1 0 $10,411.67
3 3 9/6/2021 4535.379883 4535.379883 4457.660156 ... $10,335.25 10300 1 0 $10,335.25
4 4 9/13/2021 4474.810059 4492.990234 4427.759766 ... $10,375.93 10400 1 0 $10,375.93
5 5 9/20/2021 4402.950195 4465.399902 4305.910156 ... $10,528.57 10500 1 0 $10,528.57
6 6 9/27/2021 4442.120117 4457.299805 4288.520020 ... $10,395.95 10600 1 0 $10,395.95
7 7 10/4/2021 4348.839844 4429.970215 4278.939941 ... $10,577.79 10700 1 0 $10,577.79
8 8 10/11/2021 4385.439941 4475.819824 4329.919922 ... $10,870.57 10800 1 0 $10,870.57
9 9 10/18/2021 4463.720215 4559.669922 4447.470215 ... $11,149.33 10900 1 0 $11,149.33
10 10 10/25/2021 4553.689941 4608.080078 4537.359863 ... $11,397.70 11000 1 0 $11,397.70
11 11 11/1/2021 4610.620117 4718.500000 4595.060059 ... $11,725.75 11100 1 0 $11,725.75
12 12 11/8/2021 4701.479980 4714.919922 4630.859863 ... $11,789.11 11200 1 0 $11,789.11
13 13 11/15/2021 4689.299805 4717.750000 4672.779785 ... $11,927.15 11300 1 0 $11,927.15
14 14 11/22/2021 4712.000000 4743.830078 4585.430176 ... $11,764.79 11400 1 0 $11,764.79
15 15 11/29/2021 4628.750000 4672.950195 4495.120117 ... $11,720.92 11500 -1 100 $11,720.92
16 16 12/6/2021 4548.370117 4713.569824 4540.509766 ... $12,269.23 11600 -1 100 $11,820.92
17 17 12/13/2021 4710.299805 4731.990234 4600.220215 ... $12,131.29 11700 1 0 $11,920.92
18 18 12/20/2021 4587.899902 4740.740234 4531.100098 ... $12,507.36 11800 1 0 $12,292.19
19 19 12/27/2021 4733.990234 4808.930176 4733.990234 ... $12,714.25 11900 1 0 $12,497.25
20 20 1/3/2022 4778.140137 4818.620117 4662.740234 ... $12,576.44 12000 1 0 $12,363.49
21 21 1/10/2022 4655.339844 4748.830078 4582.240234 ... $12,638.31 12100 1 0 $12,426.01
22 22 1/17/2022 4632.240234 4632.240234 4395.339844 ... $12,020.29 12200 1 0 $11,820.05
23 23 1/24/2022 4356.319824 4453.229980 4222.620117 ... $12,212.97 12300 -1 100 $12,011.19
24 24 1/31/2022 4431.790039 4595.310059 4414.020020 ... $12,502.23 12400 -1 100 $12,111.19
25 25 2/7/2022 4505.750000 4590.029785 4401.410156 ... $12,374.75 12500 1 0 $12,211.19
26 26 2/14/2022 4412.609863 4489.549805 4327.220215 ... $12,279.35 12600 1 0 $12,118.38
27 27 2/21/2022 4332.740234 4385.339844 4114.649902 ... $12,480.38 12700 1 0 $12,318.08
28 28 2/28/2022 4354.169922 4416.779785 4279.540039 ... $12,421.61 12800 1 0 $12,261.37
29 29 3/7/2022 4327.009766 4327.009766 4157.870117 ... $12,164.19 12900 -1 100 $12,008.56
30 30 3/14/2022 4202.750000 4465.399902 4161.720215 ... $13,012.99 13000 -1 100 $12,108.56
31 31 3/21/2022 4462.399902 4546.029785 4424.299805 ... $13,346.07 13100 1 0 $12,208.56
32 32 3/28/2022 4541.089844 4637.299805 4507.569824 ... $13,454.30 13200 1 0 $12,316.09
33 33 4/4/2022 4547.970215 4593.450195 4450.040039 ... $13,383.88 13300 1 0 $12,260.08
34 34 4/11/2022 4462.640137 4471.000000 4381.339844 ... $13,198.53 13400 1 0 $12,098.70
35 35 4/18/2022 4385.629883 4512.939941 4267.620117 ... $12,935.53 13500 -1 100 $11,865.95
36 36 4/25/2022 4255.339844 4308.450195 4124.279785 ... $12,612.05 13600 -1 100 $11,965.95
37 37 5/2/2022 4130.609863 4307.660156 4062.510010 ... $12,685.83 13700 -1 100 $12,065.95
38 38 5/9/2022 4081.270020 4081.270020 3858.870117 ... $12,479.86 13800 -1 100 $12,165.95
39 39 5/16/2022 4013.020020 4090.719971 3810.320068 ... $12,199.84 13900 -1 100 $12,265.95
40 40 5/23/2022 3919.419922 4158.490234 3875.129883 ... $13,103.12 14000 -1 100 $12,365.95
41 41 5/30/2022 4151.089844 4177.509766 4073.850098 ... $13,046.51 14100 1 0 $12,465.95
42 42 6/6/2022 4134.720215 4168.779785 3900.159912 ... $12,487.03 14200 1 0 $11,935.81
43 43 6/13/2022 3838.149902 3838.149902 3636.870117 ... $11,863.52 14300 -1 100 $11,344.24
44 44 6/20/2022 3715.310059 3913.649902 3715.310059 ... $12,728.31 14400 -1 100 $11,444.24
45 45 6/27/2022 3920.760010 3945.860107 3738.669922 ... $12,547.14 14500 -1 100 $11,544.24
46 46 7/4/2022 3792.610107 3918.500000 3742.060059 ... $12,890.03 14600 -1 100 $11,644.24
47 47 7/11/2022 3880.939941 3880.939941 3721.560059 ... $12,870.29 14700 -1 100 $11,744.24
48 48 7/18/2022 3883.790039 4012.439941 3818.629883 ... $13,298.35 14800 -1 100 $11,844.24
49 49 7/25/2022 3965.719971 4140.149902 3910.739990 ... $13,964.51 14900 1 0 $11,944.24
50 50 8/1/2022 4112.379883 4167.660156 4079.810059 ... $14,114.88 15000 1 0 $12,087.33
51 51 8/8/2022 4155.930176 4280.470215 4112.089844 ... $14,674.44 15100 1 0 $12,580.87
52 52 8/15/2022 4269.370117 4325.279785 4253.080078 ... $14,786.75 15200 1 0 $12,691.42
53 53 8/19/2022 4266.310059 4266.310059 4218.700195 ... $14,696.00 15300 1 0 $12,627.70
Python Code:
from ctypes.wintypes import VARIANT_BOOL
from xml.dom.expatbuilder import FilterVisibilityController
import ccxt
from matplotlib import pyplot as plt
import config
import schedule
import pandas as pd
import pandas_ta as ta
pd.set_option('display.max_rows', None)
#pd.set_option('display.max_columns', None)
import warnings
warnings.filterwarnings('ignore')
import numpy as np
from datetime import datetime
import time
import yfinance as yf
ticker = yf.Ticker('^GSPC')
df = ticker.history(period="1y", interval="1wk")
df.reset_index(inplace=True)
df.rename(columns = {'Date':'timestamp'}, inplace = True)
#df.drop(columns ={'Open', 'High', 'Low', 'Volume'}, inplace=True, axis=1)
df.drop(columns ={'Dividends', 'Stock Splits'}, inplace=True, axis=1)
# df['Close'].ffill(axis = 0, inplace = True)
invest = 10000
weekly = 100
fee = .15/100
fees = 1-fee
df.loc[df.index == 0, 'rate'] = 1
df.loc[df.index > 0, 'rate'] = (df['Close'] / df['Close'].shift(1))-1
df.loc[df.index == 0, 'account bal'] = invest
for i in range(1, len(df)):
df.loc[i, 'account bal'] = (df.loc[i-1, 'account bal'] * (1 + df.loc[i, 'rate'])) + weekly
df['invested'] = (df.index*weekly)+invest
#Supertrend
ATR = 10
Mult = 1.0
ST = ta.supertrend(df['High'], df['Low'], df['Close'], ATR, Mult)
df[f'ST_{ATR}_{Mult}'] = ST[f'SUPERTd_{ATR}_{Mult}']
df[f'ST_{ATR}_{Mult}'] = df[f'ST_{ATR}_{Mult}'].shift(1).fillna(1)
df.loc[df[f'ST_{ATR}_{Mult}'] == 1, 'if trend'] = 0
df.loc[df[f'ST_{ATR}_{Mult}'] == -1, 'if trend'] = weekly
# df.loc[df.index == 0, 'trend bal'] = invest
# for i in range(1, len(df)):
# np.where(df.loc[df[f'ST_{ATR}_{Mult}'] == 1, 'trend bal'], (df.loc[i-1, 'trend bal'] * (1 + df.loc[i, 'rate'])) + weekly, df.loc[i-i, 'trend bal'] + df['if trend'])
# df.loc[df.index == 0, 'trend bal'] = invest
# for i in range(1, len(df)):
# if df[f'ST_{ATR}_{Mult}'] == 1:
# df.loc[i, 'trend bal'] = (df.loc[i-1, 'trend bal'] * (1 + df.loc[i, 'rate'])) + weekly
# else:
# df.loc[i, 'trend bal'] = df.loc[i-i, 'trend bal'] + df['if trend']
# for i in range(1, len(df)):
# df.loc[df[f'ST_{ATR}_{Mult}'].shift(1) == 1, 'trend bal'] = (df.loc[i-1, 'trend bal'] * (1 + df.loc[i, 'rate'])) + weekly
# df.loc[df[f'ST_{ATR}_{Mult}'].shift(1) == -1, 'trend bal'] = df.loc[i-i, 'trend bal'] + df['if trend']
#df.to_csv('GSPC.csv',index=False,mode='a')
# plt.plot(df['timestamp'], df['account bal'])
# plt.plot(df['timestamp'], df['invested'])
# plt.plot(df['timestamp'], df['close'])
# plt.show()
print(df)
What some errors looks like:
np.where(df.loc[df[f'ST_{ATR}_{Mult}'] == 1, 'trend bal'], (df.loc[i-1, 'trend bal'] * (1 + df.loc[i, 'rate'])) + weekly, df.loc[i-i, 'trend bal'] + df['if trend'])
File "<__array_function__ internals>", line 180, in where
ValueError: operands could not be broadcast together with shapes (36,) () (54,)
Another error:
line 1535, in __nonzero__
raise ValueError(
ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
No error but not the correct amounts:
df['trend bal'] = 0
for i in range(1, len(df)):
df.loc[df[f'ST_{ATR}_{Mult}'].shift(1) == 1, 'trend bal'] = (df.loc[i-1, 'trend bal'] * (1 + df.loc[i, 'rate'])) + weekly
df.loc[df[f'ST_{ATR}_{Mult}'].shift(1) == -1, 'trend bal'] = df.loc[i-i, 'trend bal'] + df['if trend']
See photo of screenshot of excel formula:
excel spreadsheet
*** Made correct calculations thanks to Ingwersen_erik:
from re import X
import pandas as pd
import pandas_ta as ta
import numpy as np
pd.set_option('display.max_rows', None)
df = pd.read_csv('etcusd.csv')
invest = 10000
weekly = 100
fee = .15/100
fees = 1-fee
df.loc[df.index == 0, 'rate'] = 1
df.loc[df.index > 0, 'rate'] = (df['Close'] / df['Close'].shift(1))-1
df.loc[df.index == 0, 'account bal'] = invest
for i in range(1, len(df)):
df.loc[i, 'account bal'] = (df.loc[i-1, 'account bal'] * (1 + df.loc[i, 'rate'])) + weekly
df['invested'] = (df.index*weekly)+invest
MDD = ((df['account bal']-df['account bal'].max()) / df['account bal'].max()).min()
#Supertrend
ATR = 10
Mult = 1.0
ST = ta.supertrend(df['High'], df['Low'], df['Close'], ATR, Mult)
df[f'ST_{ATR}_{Mult}'] = ST[f'SUPERTd_{ATR}_{Mult}']
df[f'ST_{ATR}_{Mult}'] = df[f'ST_{ATR}_{Mult}'].shift(1).fillna(1)
df.loc[df.index == 0, "trend bal"] = invest
for index, row in df.iloc[1:].iterrows():
row['trend bal'] = np.where(
df.loc[index - 1, f'ST_{ATR}_{Mult}'] == 1,
(df.loc[index - 1, 'trend bal'] * (1 + row['rate'])) + weekly,
df.loc[index - 1, 'trend bal'] + weekly,
)
df.loc[df.index == index, 'trend bal'] = row['trend bal']
print(df)

Does this solve your problem?
import time
import ccxt
import warnings
import pandas as pd
import pandas_ta as ta
import yfinance as yf
import numpy as np
import matplotlib.pyplot as plt
from datetime import datetime
from ctypes.wintypes import VARIANT_BOOL
from xml.dom.expatbuilder import FilterVisibilityController
warnings.filterwarnings("ignore")
pd.set_option("display.max_rows", None)
pd.set_option("display.max_columns", None)
invest = 10_000
weekly = 100
fee = 0.15 / 100
fees = 1 - fee
ATR = 10
Mult = 1.0
ticker = yf.Ticker("^GSPC")
df = (
ticker.history(period="1y", interval="1wk")
.reset_index()
.rename(columns={"Date": "timestamp"})
.drop(columns={"Dividends", "Stock Splits"}, errors="ignore")
)
df.loc[df.index == 0, "rate"] = 1
df.loc[df.index > 0, "rate"] = (df["Close"] / df["Close"].shift(1)) - 1
df.loc[df.index == 0, "account bal"] = invest
df.loc[df.index == 0, "account bal"] = invest
for i in range(1, len(df)):
df.loc[i, "account bal"] = (
df.loc[i - 1, "account bal"] * (1 + df.loc[i, "rate"])
) + weekly
df["invested"] = (df.index * weekly) + invest
# Super-trend
ST = ta.supertrend(df["High"], df["Low"], df["Close"], ATR, Mult)
df[f"ST_{ATR}_{Mult}"] = ST[f"SUPERTd_{ATR}_{Mult}"]
df[f"ST_{ATR}_{Mult}"] = df[f"ST_{ATR}_{Mult}"].shift(1).fillna(1)
df.loc[df[f"ST_{ATR}_{Mult}"] == 1, "if trend"] = 0
df.loc[df[f"ST_{ATR}_{Mult}"] == -1, "if trend"] = weekly
df.loc[df.index == 0, "trend bal"] = invest
# === Potential correction to the np.where ==============================
for index, row in df.iloc[1:].iterrows():
row["trend bal"] = np.where(
row[f"ST_{ATR}_{Mult}"] == 1,
(df.loc[index - 1, "trend bal"] * (1 + row["rate"])) + weekly,
df.loc[index - 1, "trend bal"] + row["if trend"],
)
# NOTE: The original "otherwise" clause from `np.where` had the
# following value: `df.loc[index - index, "trend bal"] + ...`
# I assumed you meant `index -1`, instead of `index - index`,
# therefore the above code uses `index -1`. If you really meant
# `index - index`, please change the code accordingly.
df.loc[df.index == index, "trend bal"] = row["trend bal"]
df
Result:
timestamp
Open
High
Low
Close
Volume
rate
account bal
invested
ST_10_1.0
if trend
trend bal
2021-08-16
4382.44
4444.35
4367.73
4441.67
5988610000
1
10000
10000
1
0
10000
2021-08-23
4450.29
4513.33
4450.29
4509.37
14124930000
0.0152421
10252.4
10100
1
0
10252.4
2021-08-30
4513.76
4545.85
4513.76
4535.43
14256180000
0.00577909
10411.7
10200
1
0
10411.7
2021-09-06
4535.38
4535.38
4457.66
4458.58
11793790000
-0.0169444
10335.3
10300
1
0
10335.3
2021-09-13
4474.81
4492.99
4427.76
4432.99
17763120000
-0.00573946
10375.9
10400
1
0
10375.9
2021-09-20
4402.95
4465.4
4305.91
4455.48
15697030000
0.00507327
10528.6
10500
1
0
10528.6
2021-09-27
4442.12
4457.3
4288.52
4357.04
15555390000
-0.0220941
10396
10600
1
0
10396
2021-10-04
4348.84
4429.97
4278.94
4391.34
14795520000
0.00787227
10577.8
10700
1
0
10577.8
2021-10-11
4385.44
4475.82
4329.92
4471.37
13758090000
0.0182246
10870.6
10800
1
0
10870.6
2021-10-18
4463.72
4559.67
4447.47
4544.9
13966070000
0.0164446
11149.3
10900
1
0
11149.3
2021-10-25
4553.69
4608.08
4537.36
4605.38
16206040000
0.0133072
11397.7
11000
1
0
11397.7
2021-11-01
4610.62
4718.5
4595.06
4697.53
16397220000
0.0200092
11725.8
11100
1
0
11725.8
2021-11-08
4701.48
4714.92
4630.86
4682.85
15646510000
-0.00312498
11789.1
11200
1
0
11789.1
2021-11-15
4689.3
4717.75
4672.78
4697.96
15279660000
0.00322664
11927.2
11300
1
0
11927.2
2021-11-22
4712
4743.83
4585.43
4594.62
11775840000
-0.0219967
11764.8
11400
1
0
11764.8
2021-11-29
4628.75
4672.95
4495.12
4538.43
20242840000
-0.0122295
11720.9
11500
-1
100
11864.8
2021-12-06
4548.37
4713.57
4540.51
4712.02
15411530000
0.0382489
12269.2
11600
-1
100
11964.8
2021-12-13
4710.3
4731.99
4600.22
4620.64
19184960000
-0.0193929
12131.3
11700
1
0
11832.8
2021-12-20
4587.9
4740.74
4531.1
4725.79
10594350000
0.0227566
12507.4
11800
1
0
12202
2021-12-27
4733.99
4808.93
4733.99
4766.18
11687720000
0.00854675
12714.3
11900
1
0
12406.3
2022-01-03
4778.14
4818.62
4662.74
4677.03
16800900000
-0.0187048
12576.4
12000
1
0
12274.3
2022-01-10
4655.34
4748.83
4582.24
4662.85
17126800000
-0.00303177
12638.3
12100
1
0
12337.1
2022-01-17
4632.24
4632.24
4395.34
4397.94
14131200000
-0.0568129
12020.3
12200
1
0
11736.1
2022-01-24
4356.32
4453.23
4222.62
4431.85
21218590000
0.00771046
12213
12300
-1
100
11836.1
2022-01-31
4431.79
4595.31
4414.02
4500.53
18846100000
0.0154968
12502.2
12400
-1
100
11936.1
2022-02-07
4505.75
4590.03
4401.41
4418.64
19119200000
-0.0181956
12374.7
12500
1
0
11819
2022-02-14
4412.61
4489.55
4327.22
4348.87
17775970000
-0.0157899
12279.4
12600
1
0
11732.3
2022-02-21
4332.74
4385.34
4114.65
4384.65
16834460000
0.00822737
12480.4
12700
1
0
11928.9
2022-02-28
4354.17
4416.78
4279.54
4328.87
22302830000
-0.0127216
12421.6
12800
1
0
11877.1
2022-03-07
4327.01
4327.01
4157.87
4204.31
23849630000
-0.0287743
12164.2
12900
-1
100
11977.1
2022-03-14
4202.75
4465.4
4161.72
4463.12
24946690000
0.0615583
13013
13000
-1
100
12077.1
2022-03-21
4462.4
4546.03
4424.3
4543.06
19089240000
0.0179112
13346.1
13100
1
0
12393.4
2022-03-28
4541.09
4637.3
4507.57
4545.86
19212230000
0.000616282
13454.3
13200
1
0
12501.1
2022-04-04
4547.97
4593.45
4450.04
4488.28
19383860000
-0.0126665
13383.9
13300
1
0
12442.7
2022-04-11
4462.64
4471
4381.34
4392.59
13812410000
-0.02132
13198.5
13400
1
0
12277.4
2022-04-18
4385.63
4512.94
4267.62
4271.78
18149540000
-0.0275032
12935.5
13500
-1
100
12377.4
2022-04-25
4255.34
4308.45
4124.28
4131.93
19610750000
-0.032738
12612
13600
-1
100
12477.4
2022-05-02
4130.61
4307.66
4062.51
4123.34
21039720000
-0.00207901
12685.8
13700
-1
100
12577.4
2022-05-09
4081.27
4081.27
3858.87
4023.89
23166570000
-0.0241188
12479.9
13800
-1
100
12677.4
2022-05-16
4013.02
4090.72
3810.32
3901.36
20590520000
-0.0304506
12199.8
13900
-1
100
12777.4
2022-05-23
3919.42
4158.49
3875.13
4158.24
19139100000
0.0658437
13103.1
14000
-1
100
12877.4
2022-05-30
4151.09
4177.51
4073.85
4108.54
16049940000
-0.0119522
13046.5
14100
1
0
12823.5
2022-06-06
4134.72
4168.78
3900.16
3900.86
17547150000
-0.0505484
12487
14200
1
0
12275.3
2022-06-13
3838.15
3838.15
3636.87
3674.84
24639140000
-0.0579411
11863.5
14300
-1
100
12375.3
2022-06-20
3715.31
3913.65
3715.31
3911.74
19287840000
0.0644654
12728.3
14400
-1
100
12475.3
2022-06-27
3920.76
3945.86
3738.67
3825.33
17735450000
-0.0220899
12547.1
14500
-1
100
12575.3
2022-07-04
3792.61
3918.5
3742.06
3899.38
14223350000
0.0193578
12890
14600
-1
100
12675.3
2022-07-11
3880.94
3880.94
3721.56
3863.16
16313500000
-0.00928865
12870.3
14700
-1
100
12775.3
2022-07-18
3883.79
4012.44
3818.63
3961.63
16859220000
0.0254895
13298.4
14800
-1
100
12875.3
2022-07-25
3965.72
4140.15
3910.74
4130.29
17356830000
0.0425734
13964.5
14900
1
0
13523.5
2022-08-01
4112.38
4167.66
4079.81
4145.19
18072230000
0.00360747
14114.9
15000
1
0
13672.3
2022-08-08
4155.93
4280.47
4112.09
4280.15
18117740000
0.0325582
14674.4
15100
1
0
14217.4
2022-08-15
4269.37
4325.28
4218.7
4228.48
16255850000
-0.012072
14597.3
15200
1
0
14145.8
2022-08-19
4266.31
4266.31
4218.7
4228.48
2045645000
0
14697.3
15300
1
0
14245.8

Related

Finding mean/SD of a group of population and mean/SD of remaining population within a data frame

I have a pandas data frame that looks like this:
id age weight group
1 12 45 [10-20]
1 18 110 [10-20]
1 25 25 [20-30]
1 29 85 [20-30]
1 32 49 [30-40]
1 31 70 [30-40]
1 37 39 [30-40]
I am looking for a data frame that would look like this: (sd=standard deviation)
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
[10-20]
[20-30]
[30-40]
Here the second/third columns are mean and SD for that group. columns third and fourth are mean and SD for the rest of the groups combined.
Here's a way to do it:
res = df.group.to_frame().groupby('group').count()
for group in res.index:
mask = df.group==group
srGroup, srOther = df.loc[mask, 'weight'], df.loc[~mask, 'weight']
res.loc[group, ['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight']] = [
srGroup.mean(), srGroup.std(), srOther.mean(), srOther.std()]
res = res.reset_index()
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
0 [10-20] 77.500000 45.961941 53.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596
An alternative way to get the same result is:
res = ( pd.DataFrame(
df.group.drop_duplicates().to_frame()
.apply(lambda x: [
df.loc[df.group==x.group,'weight'].mean(),
df.loc[df.group==x.group,'weight'].std(),
df.loc[df.group!=x.group,'weight'].mean(),
df.loc[df.group!=x.group,'weight'].std()], axis=1, result_type='expand')
.to_numpy(),
index=list(df.group.drop_duplicates()),
columns=['group_mean_weight','group_sd_weight','rest_mean_weight','rest_sd_weight'])
.reset_index().rename(columns={'index':'group'}) )
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight
0 [10-20] 77.500000 45.961941 53.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596
UPDATE:
OP asked in a comment: "what if I have more than one weight column? what if I have around 10 different weight columns and I want sd for all weight columns?"
To illustrate below, I have created two weight columns (weight and weight2) and have simply provided all 4 aggregates (mean, sd, mean of other, sd of other) for each weight column.
wgtCols = ['weight','weight2']
res = ( pd.concat([ pd.DataFrame(
df.group.drop_duplicates().to_frame()
.apply(lambda x: [
df.loc[df.group==x.group,wgtCol].mean(),
df.loc[df.group==x.group,wgtCol].std(),
df.loc[df.group!=x.group,wgtCol].mean(),
df.loc[df.group!=x.group,wgtCol].std()], axis=1, result_type='expand')
.to_numpy(),
index=list(df.group.drop_duplicates()),
columns=[f'group_mean_{wgtCol}',f'group_sd_{wgtCol}',f'rest_mean_{wgtCol}',f'rest_sd_{wgtCol}'])
for wgtCol in wgtCols], axis=1)
.reset_index().rename(columns={'index':'group'}) )
Input:
id age weight weight2 group
0 1 12 45 55 [10-20]
1 1 18 110 120 [10-20]
2 1 25 25 35 [20-30]
3 1 29 85 95 [20-30]
4 1 32 49 59 [30-40]
5 1 31 70 80 [30-40]
6 1 37 39 49 [30-40]
Output:
group group_mean_weight group_sd_weight rest_mean_weight rest_sd_weight group_mean_weight2 group_sd_weight2 rest_mean_weight2 rest_sd_weight2
0 [10-20] 77.500000 45.961941 53.60 24.016661 87.500000 45.961941 63.60 24.016661
1 [20-30] 55.000000 42.426407 62.60 28.953411 65.000000 42.426407 72.60 28.953411
2 [30-40] 52.666667 15.821926 66.25 38.378596 62.666667 15.821926 76.25 38.378596

Merging computed file contents and display previous computed data in output

I am working to 2 files, oldFile.txt and newFile.txt and compute some changes between them. The newFile.txt is updated constantly and any updates will be written to oldFile.txt
I am trying to improve the snippet below by saving previous computed values and add it to a finalOutput.txt. Any idea will be very helpful to accomplish the needed output. Thank you in advance.
import pandas as pd
from time import sleep
def read_file(fn):
data = {}
with open(fn, 'r') as f:
for lines in f:
line = lines.rstrip()
pname, cnt, cat = line.split(maxsplit=2)
data.update({pname: {'pname': pname, 'cnt': int(cnt), 'cat': cat}})
return data
def process_data(oldfn, newfn):
old = read_file(oldfn)
new = read_file(newfn)
u_data = {}
for ko, vo in old.items():
if ko in new:
n = new[ko]
old_cnt = vo['cnt']
new_cnt = n['cnt']
u_cnt = old_cnt + new_cnt
tmp_old_cnt = 1 if old_cnt == 0 else old_cnt
cnt_change = 100 * (new_cnt - tmp_old_cnt) / tmp_old_cnt
u_data.update({ko: {'pname': n['pname'], 'cnt': new_cnt, 'cat': n['cat'],
'curr_change%': round(cnt_change, 0)}})
for kn, vn in new.items():
if kn not in old:
old_cnt = 1
new_cnt = vn['cnt']
cnt_change = 0
vn.update({'cnt_change': round(cnt_change, 0)})
u_data.update({kn: vn})
pd.options.display.float_format = "{:,.0f}".format
mydata = []
for _, v in u_data.items():
mydata.append(v)
df = pd.DataFrame(mydata)
df = df.sort_values(by=['cnt'], ascending=False)
# Save to text file.
with open('finalOutput.txt', 'w') as w:
w.write(df.to_string(header=None, index=False))
# Overwrite oldFile.txt
with open('oldFile.txt', 'w') as w:
w.write(df.to_string(header=None, index=False))
# Print in console.
df.insert(0, '#', range(1, 1 + len(df)))
print(df.to_string(index=False,header=True))
while True:
oldfn = './oldFile.txt'
newfn = './newFile.txt'
process_data(oldfn, newfn)
sleep(60)
oldFile.txt
e6c76e4810a464bc 1 Hello(HLL)
65b66cc4e81ac81d 2 CryptoCars (CCAR)
c42d0c924df124ce 3 GoldNugget (NGT)
ee70ad06df3d2657 4 BabySwap (BABY)
e5b7ebc589ea9ed8 8 Heroes&E... (HE)
7e7e9d75f5da2377 3 Robox (RBOX)
newfile.txt #-- content during 1st reading
e6c76e4810a464bc 34 Hello(HLL)
65b66cc4e81ac81d 43 CryptoCars (CCAR)
c42d0c924df124ce 95 GoldNugget (NGT)
ee70ad06df3d2657 15 BabySwap (BABY)
e5b7ebc589ea9ed8 37 Heroes&E... (HE)
7e7e9d75f5da2377 23 Robox (RBOX)
755507d18913a944 49 CharliesFactory
newfile.txt #-- content during 2nd reading
924dfc924df1242d 35 AeroDie (ADie)
e6c76e4810a464bc 34 Hello(HLL)
65b66cc4e81ac81d 73 CryptoCars (CCAR)
c42d0c924df124ce 15 GoldNugget (NGT)
ee70ad06df3d2657 5 BabySwap (BABY)
e5b7ebc589ea9ed8 12 Heroes&E... (HE)
7e7e9d75f5da2377 19 Robox (RBOX)
755507d18913a944 169 CharliesFactory
newfile.txt # content during 3rd reading
924dfc924df1242d 45 AeroDie (ADie)
e6c76e4810a464bc 2 Hello(HLL)
65b66cc4e81ac81d 4 CryptoCars (CCAR)
c42d0c924df124ce 7 GoldNugget (NGT)
ee70ad06df3d2657 5 BabySwap (BABY)
e5b7ebc589ea9ed8 3 Heroes&E... (HE)
7e7e9d75f5da2377 6 Robox (RBOX)
755507d18913a944 9 CharliesFactory
oldFile.txt #-- Current output that needs improvement
# pname cnt cat curr_change%
1 924dfc924df1242d 35 AeroDie (ADie) 29
2 755507d18913a944 9 CharliesFactory -95
3 c42d0c924df124ce 7 GoldNugget (NGT) -53
4 7e7e9d75f5da2377 6 Robox (RBOX) -68
5 ee70ad06df3d2657 5 BabySwap (BABY) 0
6 65b66cc4e81ac81d 4 CryptoCars (CCAR) -95
7 e5b7ebc589ea9ed8 3 Heroes&E... (HE) -75
8 e6c76e4810a464bc 2 Hello(HLL) -94
finalOutput.txt #-- Needed Improved Output with additional columns r1, r2 and so on depending on how many update readings
# curr_change% is the latest 3rd reading
# r2% is based on the 2nd reading
# r1% is based on the 1st reading
# pname cnt cat curr_change% r2% r1%
1 924dfc924df1242d 35 AeroDie (ADie) 29 0 0
2 755507d18913a944 9 CharliesFactory -95 245 0
3 c42d0c924df124ce 7 GoldNugget (NGT) -53 -84 3,067
4 7e7e9d75f5da2377 6 Robox (RBOX) -68 -17 667
5 ee70ad06df3d2657 5 BabySwap (BABY) 0 -67 275
6 65b66cc4e81ac81d 4 CryptoCars (CCAR) -95 70 2,050
7 e5b7ebc589ea9ed8 3 Heroes&E... (HE) -75 -68 362
8 e6c76e4810a464bc 2 Hello(HLL) -94 0 3,300
Updated for feedback, I made adjustments so that it would handle data that was fed to it live. Whenever new data is loaded, load the file name into process_new_file() function, and it will update the 'finalOutput.txt'.
For simplicity, I named the different files file1, file2, file3, and file4.
I'm doing most of the operations using the pandas Dataframe. I think working with Pandas DataFrames will make the task a lot easier for you.
Overall, I created one function to read the file and return a properly formatted DataFrame. I created a second function that compares the old and the new file and does the calculation you were looking for. I merge together the results of these calculations. Finally, I merge all of these calculations with the last file's data to get the output you're looking for.
import pandas as pd
global global_old_df
global results_df
global count
global_old_df = None
results_df = pd.DataFrame()
count = 0
def read_file(file_name):
rows = []
with open(file_name) as f:
for line in f:
rows.append(line.split(" ", 2))
df = pd.DataFrame(rows, columns=['pname', 'cnt', 'cat'])
df['cat'] = df['cat'].str.strip()
df['cnt'] = df['cnt'].astype(float)
return df
def compare_dfs(df_old, df_new, count):
df_ = df_old.merge(df_new, on=['pname', 'cat'], how='outer')
df_['r%s' % count] = (df_['cnt_y'] / df_['cnt_x'] - 1) * 100
df_ = df_[['pname', 'r%s' % count]]
df_ = df_.set_index('pname')
return df_
def process_new_file(file):
global global_old_df
global results_df
global count
df_new = read_file(file)
if global_old_df is None:
global_old_df = df_new
return
else:
count += 1
r_df = compare_dfs(global_old_df, df_new, count)
results_df = pd.concat([r_df, results_df], axis=1)
global_old_df = df_new
output_df = df_new.merge(results_df, left_on='pname', right_index=True)
output_df.to_csv('finalOutput.txt')
pd.options.display.float_format = "{:,.1f}".format
print(output_df.to_string())
files = ['file1.txt', 'file2.txt', 'file3.txt', 'file4.txt']
for file in files:
process_new_file(file)
This gives the output:
pname cnt cat r3 r2 r1
0 924dfc924df1242d 45.0 AeroDie (ADie) 28.6 NaN NaN
1 e6c76e4810a464bc 2.0 Hello(HLL) -94.1 0.0 3,300.0
2 65b66cc4e81ac81d 4.0 CryptoCars (CCAR) -94.5 69.8 2,050.0
3 c42d0c924df124ce 7.0 GoldNugget (NGT) -53.3 -84.2 3,066.7
4 ee70ad06df3d2657 5.0 BabySwap (BABY) 0.0 -66.7 275.0
5 e5b7ebc589ea9ed8 3.0 Heroes&E... (HE) -75.0 -67.6 362.5
6 7e7e9d75f5da2377 6.0 Robox (RBOX) -68.4 -17.4 666.7
7 755507d18913a944 9.0 CharliesFactory -94.7 244.9 NaN
So, to run it live, you'd just replace that last section with:
while True:
newfn = './newFile.txt'
process_new_file(newfn)
sleep(60)

creating a dataframe and based on 2 dataframe sets that have different lengths

I have 2 dataframe sets , I want to create a third one. I am trying to to write a code that to do the following :
if A_pd["from"] and A_pd["To"] is within the range of B_pd["from"]and B_pd["To"] then add to the C_pd dateframe A_pd["from"] and A_pd["To"] and B_pd["Value"].
if the A_pd["from"] is within the range of B_pd["from"]and B_pd["To"] and A_pd["To"] within the range of B_pd["from"]and B_pd["To"] of teh next row , then i want to split the range A_pd["from"] and A_pd["To"] to 2 ranges (A_pd["from"] and B_pd["To"]) and ( B_pd["To"] and A_pd["To"] ) and the corresponded B_pd["Value"].
I created the following code:
import pandas as pd
A_pd = {'from':[0,20,80,180,250],
'To':[20, 50,120,210,300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0,20,100,200],
'To':[20, 100,200,300],
'Value':[20, 17,15,12]}
B_pd=pd.DataFrame(B_pd)
for i in range(len(A_pd)):
numberOfIntrupt=0
for j in range(len(B_pd)):
if A_pd["from"].values[i] >= B_pd["from"].values[j] and A_pd["from"].values[i] > B_pd["To"].values[j]:
numberOfIntrupt+=1
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols, index=range(len(A_pd)+numberOfIntrupt))
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a=A_pd ["from"].values[i]
b=A_pd["To"].values[i]
c_eval=B_pd["Value"].values[j]
range_s=B_pd["from"].values[j]
range_f=B_pd["To"].values[j]
if a >= range_s and a <= range_f and b >= range_s and b <= range_f :
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=b
C_dp['C_value'].loc[i]=c_eval
elif a >= range_s and b > range_f:
C_dp['C_from'].loc[i]=a
C_dp['C_To'].loc[i]=range_f
C_dp['C_value'].loc[i]=c_eval
C_dp['C_from'].loc[i+1]=range_f
C_dp['C_To'].loc[i+1]=b
C_dp['C_value'].loc[i+1]=B_pd["Value"].values[j+1]
print(C_dp)
The current result is C_dp:
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 180 200 15
4 250 300 12
5 200 300 12
6 NaN NaN NaN
7 NaN NaN NaN
the expected should be :
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12
Thank you a lot for the support
I'm sure there is a better way to do this without loops, but this will help your logic flow.
import pandas as pd
A_pd = {'from':[0, 20, 80, 180, 250],
'To':[20, 50, 120, 210, 300]}
A_pd=pd.DataFrame(A_pd)
B_pd = {'from':[0, 20, 100, 200],
'To':[20, 100,200, 300],
'Value':[20, 17, 15, 12]}
B_pd=pd.DataFrame(B_pd)
cols = ['C_from', 'C_To', 'C_value']
C_dp=pd.DataFrame(columns=cols)
spillover = False
for i in range(len(A_pd)):
for j in range(len(B_pd)):
a_from = A_pd["from"].values[i]
a_to = A_pd["To"].values[i]
b_from = B_pd["from"].values[j]
b_to = B_pd["To"].values[j]
b_value = B_pd['Value'].values[j]
if (a_from >= b_to):
# a_from outside b range
continue # next b
elif (a_from >= b_from):
# a_from within b range
if a_to <= b_to:
C_dp = C_dp.append({"C_from": a_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
break # next a
else:
C_dp = C_dp.append({"C_from": a_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
if j < len(B_pd):
spillover = True
continue
if spillover:
if a_to <= b_to:
C_dp = C_dp.append({"C_from": b_from, "C_To": a_to, "C_value": b_value}, ignore_index=True)
spillover = False
break
else:
C_dp = C_dp.append({"C_from": b_from, "C_To": b_to, "C_value": b_value}, ignore_index=True)
spillover = True
continue
print(C_dp)
Output
C_from C_To C_value
0 0 20 20
1 20 50 17
2 80 100 17
3 100 120 15
4 180 200 15
5 200 210 12
6 250 300 12

Pandas: use apply to create 2 new columns

I have a dataset where col a represent the number of total values in values e,i,d,t which are in string format separated by a "-"
a e i d t
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4
I want to create 8 new columns, 4 representing the SUM of (e-i-d-t), 4 the product.
For example:
def funct_two_outputs(E, I, d, t, d_calib = 50):
return E+i+d+t, E*i*d*t
OUT first 2 values:
SUM_0, row0 = 40+0.5+30+1 SUM_1 = 80+0.3+32+1
The sum and product are example functions substituting my functions which are a bit more complicated.
I have written out a function **expand_on_col ** that creates separates all the e,i,d,t values into new columns:
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
Now i need to create 4 new columsn that are the sum of eidt, and 4 that are the prodct.
Example output for SUM:
index a e i d t a-0 e-0 e-1 e-2 e-3 i-0 i-1 i-2 i-3 d-0 d-1 d-2 d-3 t-0 t-1 t-2 t-3 sum-0 sum-1 sum-2 sum-3
0 0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
1 1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
2 3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 4 40 80 120 150 0.5 0.3 0.2 0.2 30 32 30 32 1 1 1 1 71 114 153 186
3 5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 4 40 40 40 40 0.1 0.1 0.1 0.1 18 18 18 18 1 2 3 4 59 61 63 65
If i run the code with funct_one_output(only returns sum) it works, but wit the funct_two_outputs(suma and product) I get an error.
Here is the code:
import pandas as pd
def expand_on_col (df_, col_to_split = "namecol", sep='-', prefix="this"):
'''
Pass a df indicating on which col you want to split,
return a df with the col split with a prefix.
'''
df1 = df_[col_to_split].str.split(sep,expand=True).add_prefix(prefix)
df1 = pd.concat([df_,df1], axis=1).replace(np.nan, '-')
return df1
def funct_two_outputs(E, I, d, t, d_calib = 50): #the function i want to pass
return E+i+d+t, E*i*d*t
def funct_one_outputs(E, I, d, t, d_calib = 50): #for now i can olny use this one, cant use 2 return values.
return E+i+d+t
for col in columns:
df = expand_on_col (df_=df, col_to_split = col, sep='-', prefix=f"{col}-")
cols_ = df.columns.drop(columns)
df[cols_]= df[cols_].apply(pd.to_numeric, errors="coerce")
df["a"] = df["a"].apply(pd.to_numeric, errors="coerce")
df.reset_index(inplace=True)
for i in range (max(df["a"])):
name_1, name_2 = f"sum-{i}", f"mult-{i}"
df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
#if i try and fill 2 outputs it wont work
df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
OUT:
ValueError Traceback (most recent call last)
<ipython-input-306-85157b89d696> in <module>()
68 df[name_1] = df.apply(lambda row: funct_one_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
69 #if i try and fill 2 outputs it wont work
---> 70 df[[name_1, name_2]] = df.apply(lambda row: funct_two_outputs(E= row[f'e-{i}'], I=row[f'i-{i}'], d=row[f'd-{i}'], t=row[f"t-{i}"]), axis=1)
71
72
2 frames
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in __setitem__(self, key, value)
3039 self._setitem_frame(key, value)
3040 elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3041 self._setitem_array(key, value)
3042 else:
3043 # set column
/usr/local/lib/python3.7/dist-packages/pandas/core/frame.py in _setitem_array(self, key, value)
3074 )[1]
3075 self._check_setitem_copy()
-> 3076 self.iloc._setitem_with_indexer((slice(None), indexer), value)
3077
3078 def _setitem_frame(self, key, value):
/usr/local/lib/python3.7/dist-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
1751 if len(ilocs) != len(value):
1752 raise ValueError(
-> 1753 "Must have equal len keys and value "
1754 "when setting with an iterable"
1755 )
ValueError: Must have equal len keys and value when setting with an iterable
Don't Use apply
If you can help it
s = pd.to_numeric(
df[['e', 'i', 'd', 't']]
.stack()
.str.split('-', expand=True)
.stack()
)
sums = s.sum(level=[0, 2]).rename('Sum')
prods = s.prod(level=[0, 2]).rename('Prod')
sums_prods = pd.concat([sums, prods], axis=1).unstack()
sums_prods.columns = [f'{o}-{i}' for o, i in sums_prods.columns]
df.join(sums_prods)
a e i d t Sum-0 Sum-1 Sum-2 Sum-3 Prod-0 Prod-1 Prod-2 Prod-3
0 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
1 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0
3 4 40-80-120-150 0.5-0.3-0.2-0.2 30-32-30-32 1-1-1-1 71.5 113.3 151.2 183.2 600.0 768.0 720.0 960.0
5 4 40-40-40-40 0.1-0.1-0.1-0.1 18-18-18-18 1-2-3-4 59.1 60.1 61.1 62.1 72.0 144.0 216.0 288.0

What is the quickest way in python to compute inner(dot) product of normalized document vectors?

Document vectors in dictionary form is like this.
{'abcd':0.4531,
'hhks':0.08763,
'djlkl':9843
}
Length of the vector can vary.
I have tried pandas Series.
However I have seen that in case of smaller vectors pandas was about 100 times slower than a dictionary implementation.
Is there a better way for doing this?
Code using dictionaries
length of d1 is always less than length of d2
def cosine_smaller_larger(d1, d2):
s = 0.0
for key in d1.keys():
if key in d2:
s += d1[key] * d2[key]
return s
Code using pandas
from pandas import Series
def seriesmult(s1, s2):
return s1.mul(s2, fill_value=0).sum()
def cosine_smaller_larger(d1, d2):
s1 = Series(d1)
s2 = Series(d2)
return seriesmult(s1, s2)
Code used to profile the above two method(I am ignoring the time taken to create pandas Series)
def cosine_smaller_larger_comparison(d1, d2):
t1 = time.time()
s = 0.0
for key in d1.keys():
if key in d2:
s += d1[key] * d2[key]
t2 = time.time()
t3 = t2 - t1
s1 = Series(d1)
s2 = Series(d2)
t1 = time.time()
ss = s1.mul(s2, fill_value=0).sum()
t2 = time.time()
t4 = t2 - t1
try:
t5 = t4 / t3
except:
t5 = 'division by zero'
intersection = set(d1.keys()) & set(d2.keys())
num_mult = len(intersection)
ld1 = len(d1)
ld2 = len(d2)
output = "L1 = {}, L2 = {}, Mults = {}, PT<DT? = {}, PT = {}, DT = {}, PT/DT = {}".format(ld1, ld2, num_mult, t4 < t3, t4, t3, t5)
print output
return s
Case1: Large vector(L1 > 1000)
I converted the output of cosine_smaller_larger_comparison as a pandas dataframe for checking what is the behavior for large vectors.
L1 = length of first vector
L2 = length of the second vector
Mults = number of non zero multiplictions
PT = time taken by pandas
DT = time taken by dictionary implementation
PTdivDT = the factor by which dictionary beats pandas
PTltDT=Was Pandas faster than dictionary for this particular vector
(Pdb) df1.loc[df1['L1']>1000][:10]
DT L1 L2 Mults PT PTdivDT PTltDT
64002 0.000145 1064 1361 151 0.001333 9.195724 False
64308 0.000168 1064 1853 178 0.001125 6.692199 False
64362 0.000197 1044 1064 148 0.001260 6.397094 False
108372 0.000180 1018 1064 167 0.001298 7.210596 False
113457 0.001332 3141 9644 3141 0.003576 2.685106 False
113458 0.002342 3886 9083 3886 0.004181 1.785198 False
113583 0.002099 3435 9644 3433 0.003591 1.710813 False
113584 0.002662 4101 9083 4095 0.003828 1.437937 False
113592 0.000887 1853 19674 1850 0.005778 6.514785 False
113619 0.002480 3198 9644 3193 0.003207 1.293337 False
Here dictionary implementation beats pandas series, but the margin is less.
Case2: Smaller vectors
Here are some input sizes for which pandas was more than 100 times slower.
(Pdb) df1.loc[df1['PTdivDT']>100][:30]
DT L1 L2 Mults PT PTdivDT PTltDT
0 0.000002 3 3 0 0.001242 651.250000 False
1 0.000002 3 3 0 0.000558 292.625000 False
6 0.000003 3 4 1 0.000341 110.000000 False
8 0.000001 0 0 0 0.000106 111.000000 False
10 0.000001 0 30 0 0.000362 379.750000 False
18 0.000001 1 3 0 0.000339 284.200000 False
19 0.000000 1 3 0 0.000341 inf False
24 0.000001 1 3 0 0.000381 399.500000 False
26 0.000000 0 0 0 0.000103 inf False
28 0.000003 29 30 0 0.000399 128.769231 False
31 0.000004 12 20 5 0.000409 100.941176 False
32 0.000003 8 156 4 0.000377 121.615385 False
33 0.000002 11 369 0 0.000410 214.875000 False
34 0.000002 1 1 1 0.000202 105.875000 False
35 0.000003 2 60 2 0.000349 112.615385 False
36 0.000001 1 3 0 0.000335 351.250000 False
37 0.000001 1 3 0 0.000325 272.600000 False
39 0.000003 17 32 2 0.000389 136.000000 False
41 0.000003 11 18 4 0.000386 124.538462 False
42 0.000001 3 5 0 0.000332 348.250000 False
44 0.000001 0 0 0 0.000102 107.000000 False
46 0.000004 30 42 0 0.000471 116.235294 False
51 0.000010 59 369 2 0.001014 101.261905 False
54 0.000001 1 3 0 0.000518 543.250000 False
55 0.000001 1 3 0 0.000526 551.750000 False
57 0.000004 11 32 2 0.000461 113.705882 False
60 0.000001 1 3 0 0.000660 692.250000 False
62 0.000001 0 2 0 0.000293 307.000000 False
64 0.000003 26 30 0 0.000343 110.692308 False
65 0.000002 1 1 1 0.000223 116.875000 False

Categories