Python loop with groupby

Python loop with groupby - python

I have an extract of a dataframe below:
ticker date open high low close
0 A2M 2020-08-28 18.45 18.71 17.39 17.47
1 A2M 2020-09-04 17.47 17.52 16.53 16.70
2 A2M 2020-09-11 16.70 16.97 16.13 16.45
3 A2M 2020-09-18 16.54 16.77 16.25 16.39
4 A2M 2020-09-25 16.36 17.13 16.32 17.02
5 AAN 2007-06-08 15.29 15.33 14.93 15.07
6 AAN 2007-06-15 15.10 15.23 14.95 15.18
7 AAN 2007-06-22 15.18 15.25 15.12 15.16
8 AAN 2007-06-29 15.14 15.25 15.11 15.22
9 AAN 2007-07-06 15.11 15.33 15.07 15.33
10 AAN 2007-07-13 15.29 15.35 15.12 15.26
11 AAN 2007-07-20 15.25 15.27 15.02 15.10
12 AAN 2007-07-27 15.05 15.15 14.00 14.82
13 AAN 2007-08-03 14.72 14.85 14.47 14.69
14 AAN 2007-08-10 14.56 14.90 14.22 14.54
15 AAN 2007-08-17 14.55 14.79 13.71 14.42
16 AAP 2000-10-06 7.11 7.14 7.10 7.12
17 AAP 2000-10-13 7.13 7.17 7.12 7.17
18 AAP 2000-10-20 7.16 7.25 7.16 7.23
19 AAP 2000-10-27 7.23 7.24 7.22 7.23
20 AAP 2000-11-03 7.16 7.25 7.12 7.25
21 AAP 2000-11-10 7.24 7.24 7.12 7.12
22 ABB 2002-07-26 2.70 3.05 2.60 2.95
23 ABB 2002-08-02 2.92 2.95 2.75 2.80
24 ABB 2002-08-09 2.80 2.84 2.70 2.70
25 ABB 2002-08-16 2.72 2.75 2.70 2.75
26 ABB 2002-08-23 2.71 2.85 2.71 2.75
27 ABB 2002-08-30 2.75 2.75 2.75 2.75
I've created the following code to find upPrices vs. downPrices:
i = 0
upPrices=[]
downPrices=[]
while i < len(df['close']):
if i == 0:
upPrices.append(0)
downPrices.append(0)
else:
if (df['close'][i]-df['close'][i-1])>0:
upPrices.append(df['close'][i]-df['close'][i-1])
downPrices.append(0)
else:
downPrices.append(df['close'][i]-df['close'][i-1])
upPrices.append(0)
i += 1
df['upPrices'] = upPrices
df['downPrices'] = downPrices
The result is the following dataframe:
ticker date open high low close upPrices downPrices
0 A2M 2020-08-28 18.45 18.71 17.39 17.47 0.00 0.00
1 A2M 2020-09-04 17.47 17.52 16.53 16.70 0.00 -0.77
2 A2M 2020-09-11 16.70 16.97 16.13 16.45 0.00 -0.25
3 A2M 2020-09-18 16.54 16.77 16.25 16.39 0.00 -0.06
4 A2M 2020-09-25 16.36 17.13 16.32 17.02 0.63 0.00
5 AAN 2007-06-08 15.29 15.33 14.93 15.07 0.00 -1.95
6 AAN 2007-06-15 15.10 15.23 14.95 15.18 0.11 0.00
7 AAN 2007-06-22 15.18 15.25 15.12 15.16 0.00 -0.02
8 AAN 2007-06-29 15.14 15.25 15.11 15.22 0.06 0.00
9 AAN 2007-07-06 15.11 15.33 15.07 15.33 0.11 0.00
10 AAN 2007-07-13 15.29 15.35 15.12 15.26 0.00 -0.07
11 AAN 2007-07-20 15.25 15.27 15.02 15.10 0.00 -0.16
12 AAN 2007-07-27 15.05 15.15 14.00 14.82 0.00 -0.28
13 AAN 2007-08-03 14.72 14.85 14.47 14.69 0.00 -0.13
14 AAN 2007-08-10 14.56 14.90 14.22 14.54 0.00 -0.15
15 AAN 2007-08-17 14.55 14.79 13.71 14.42 0.00 -0.12
16 AAP 2000-10-06 7.11 7.14 7.10 7.12 0.00 -7.30
17 AAP 2000-10-13 7.13 7.17 7.12 7.17 0.05 0.00
18 AAP 2000-10-20 7.16 7.25 7.16 7.23 0.06 0.00
19 AAP 2000-10-27 7.23 7.24 7.22 7.23 0.00 0.00
20 AAP 2000-11-03 7.16 7.25 7.12 7.25 0.02 0.00
21 AAP 2000-11-10 7.24 7.24 7.12 7.12 0.00 -0.13
22 ABB 2002-07-26 2.70 3.05 2.60 2.95 0.00 -4.17
23 ABB 2002-08-02 2.92 2.95 2.75 2.80 0.00 -0.15
24 ABB 2002-08-09 2.80 2.84 2.70 2.70 0.00 -0.10
25 ABB 2002-08-16 2.72 2.75 2.70 2.75 0.05 0.00
26 ABB 2002-08-23 2.71 2.85 2.71 2.75 0.00 0.00
27 ABB 2002-08-30 2.75 2.75 2.75 2.75 0.00 0.00
Unfortunately the logic is not correct. The upPrices and downPrices need to be for each ticker. At the moment, you can see that in rows 5, 16 and 22 it compares the previous close from another ticker. Essentially, I need this formula to groupby or some other means to restart at each ticker. However, when I try add in groupby it returns index length mismatch errors.
Please help!

Your intuition of groupby is correct. groupby ticker then diff the closing prices. You can use where to get it separated into the style of up and down columns you wanted. Plus, now no more loop! For something that just requires "basic" math operations a vectorized approach is much better.
import pandas as pd
data = {"ticker":["A2M","A2M","A2M","A2M","A2M","AAN","AAN","AAN","AAN"], "close":[17.47,16.7,16.45,16.39,17.02,15.07,15.18,15.16,15.22]}
df = pd.DataFrame(data)
df["diff"] = df.groupby("ticker")["close"].diff()
df["upPrice"] = df["diff"].where(df["diff"] > 0, 0)
df["downPrice"] = df["diff"].where(df["diff"] < 0, 0)
del df["diff"]
print(df)

Related

Create new columns from existing column and sort the values from smallest to largest

I want to locate all values greater than 0 within columns n_1 to n_3 inclusive, and populate them into columns new_1 to new_3 inclusive in order of smallest to largest, such that column new_1 has the smallest value and new_3 has the largest value. If any columns are not populated because there are not enough values to do so, then populate them with 0
EVENT_ID n_1 n_2 n_3
143419013 0.00 7.80 12.83
143419017 1.72 20.16 16.08
143419021 3.03 12.00 17.14
143419025 2.63 0.00 2.51
143419028 2.38 22.00 2.96
143419030 0.00 40.00 0.00
Expected Output:
EVENT_ID n_1 n_2 n_3 new_1 new_2 new_3
143419013 0.00 7.80 12.83 7.80 12.83 0.00
143419017 1.72 20.16 16.08 1.72 16.08 20.16
143419021 3.03 12.00 17.14 3.03 12.00 17.14
143419025 2.63 0.00 2.51 2.51 13.78 0.00
143419028 2.38 22.00 2.96 2.38 2.96 22.00
143419030 3.92 40.00 11.23 40.00 0.00 0.00
I tried using the apply function to create this new column but I got an error down the line.
df[['new_1','new_2','new_3']] = pivot_df.apply(lambda a,b,c: a.n_1, b.n_2, c.n_3 axis=1)

Let's subset the DataFrame remove values that do not meet the condition with where, then use np.sort to sort across rows and fillna to replace any missing values with 0:
cols = ['n_1', 'n_2', 'n_3']
df[[f'new_{i}' for i in range(1, len(cols) + 1)]] = pd.DataFrame(
np.sort(df[cols].where(df[cols] > 0), axis=1)
).fillna(0)
df:
EVENT_ID n_1 n_2 n_3 new_1 new_2 new_3
0 143419013 0.00 7.80 12.83 7.80 12.83 0.00
1 143419017 1.72 20.16 16.08 1.72 16.08 20.16
2 143419021 3.03 12.00 17.14 3.03 12.00 17.14
3 143419025 2.63 0.00 2.51 2.51 2.63 0.00
4 143419028 2.38 22.00 2.96 2.38 2.96 22.00
5 143419030 0.00 40.00 0.00 40.00 0.00 0.00
Setup used:
import numpy as np
import pandas as pd
df = pd.DataFrame({
'EVENT_ID': [143419013, 143419017, 143419021, 143419025, 143419028,
143419030],
'n_1': [0.0, 1.72, 3.03, 2.63, 2.38, 0.0],
'n_2': [7.8, 20.16, 12.0, 0.0, 22.0, 40.0],
'n_3': [12.83, 16.08, 17.14, 2.51, 2.96, 0.0]
})
df:
EVENT_ID n_1 n_2 n_3
0 143419013 0.00 7.80 12.83
1 143419017 1.72 20.16 16.08
2 143419021 3.03 12.00 17.14
3 143419025 2.63 0.00 2.51
4 143419028 2.38 22.00 2.96
5 143419030 0.00 40.00 0.00

Dataframe Metrics Output Formatting on Python

I splitted my dataframe into two, one with having null values of a specifc column and other is null free. Then I selected a couple of metrics and tried to present them in a comparable way. I have already created the output. However, i believe this code is not very attractive and I would like to know if there is an easier way to get the output I wanted. and If I can even make it two metrics per row it would be perfect.
dataframe1 (with null):
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
0.02731
0.0
7.07
0.0
NaN
6.421
78.9
4.9671
2.0
242.0
17.8
0.08829
12.5
7.87
0.0
NaN
6.012
66.6
5.5605
5.0
311.0
15.2
0.22489
12.5
7.87
0.0
NaN
6.377
94.3
6.3467
5.0
311.0
15.2
dataframe1 (without null):
CRIM
ZN
INDUS
CHAS
NOX
RM
AGE
DIS
RAD
TAX
PTRATIO
0.32711
0.0
8.07
0.0
0.538
6.851
75.9
3.9371
1.0
242.0
14.2
0.21859
16.5
8.85
0.0
0.469
6.124
61.6
6.5601
3.0
242.0
11.6
0.42282
13.5
8.85
0.0
0.458
6.732
97.3
5.3627
3.0
311.0
18.1
print('min', 'Null Free', 'With Nulls', sep='\t')
for i in housing.columns:
if i == 'NOX':
pass
else:
print(i, "{:.2f}".format(housing_nullfree.min().T.loc[i]),"{:.2f}".format(housing_nulls.min().T.loc[i]), sep="\t\t")
print()
print('Perc25', 'Null Free', 'With Nulls', sep='\t')
for i in housing.columns:
if i == 'NOX':
pass
else:
print(i, "{:.2f}".format(np.percentile(housing_nullfree[i], 25)),"{:.2f}".format(np.percentile(housing_nulls[i], 25)), sep="\t\t")
print()
print('mean', 'Null Free', 'With Nulls', sep='\t')
for i in housing.columns:
if i == 'NOX':
pass
else:
print(i, '{:.2f}'.format(housing_nullfree.mean().T.loc[i]),'{:.2f}'.format(housing_nulls.mean().T.loc[i]), sep="\t\t")
print()
print('median', 'Null Free', 'With Nulls', sep='\t')
for i in housing.columns:
if i == 'NOX':
pass
else:
print(i, '{:.2f}'.format(np.median(housing_nullfree[i])), '{:.2f}'.format(np.median(housing_nulls[i])), sep="\t\t")
print()
print('Perc75', 'Null Free', 'With Nulls', sep='\t')
for i in housing.columns:
if i == 'NOX':
pass
else:
print(i, "{:.2f}".format(np.percentile(housing_nullfree[i], 75)),"{:.2f}".format(np.percentile(housing_nulls[i], 75)), sep="\t\t")
print()
print('Perc90', 'Null Free', 'With Nulls', sep='\t')
for i in housing.columns:
if i == 'NOX':
pass
else:
print(i, "{:.2f}".format(np.percentile(housing_nullfree[i], 90)),"{:.2f}".format(np.percentile(housing_nulls[i], 90)), sep="\t\t")
print()
print('max', 'Null Free', 'With Nulls', sep='\t')
for i in housing.columns:
if i == 'NOX':
pass
else:
print(i, "{:.2f}".format(housing_nullfree.max().T.loc[i]),"{:.2f}".format(housing_nulls.max().T.loc[i]), sep="\t\t")
my output:
min Null Free With Nulls
CRIM 0.01 0.02
ZN 0.00 0.00
INDUS 0.46 1.25
CHAS 0.00 0.00
RM 3.56 4.65
AGE 2.90 6.50
DIS 1.13 1.20
RAD 1.00 1.00
TAX 188.00 187.00
PTRATIO 12.60 13.00
B 0.32 27.49
LSTAT 1.73 2.96
NOX_cat 0.00 1.00
Perc25 Null Free With Nulls
CRIM 0.08 0.06
ZN 0.00 0.00
INDUS 5.32 4.05
CHAS 0.00 0.00
RM 5.88 6.01
AGE 45.70 42.40
DIS 2.06 2.51
RAD 4.00 4.00
TAX 281.00 276.00
PTRATIO 17.40 16.60
B 374.68 378.95
LSTAT 7.18 6.90
NOX_cat 0.00 1.00
mean Null Free With Nulls
CRIM 3.78 2.38
ZN 10.70 16.20
INDUS 11.42 9.09
CHAS 0.07 0.08
RM 6.27 6.36
AGE 68.96 65.77
DIS 3.70 4.49
RAD 9.59 9.28
TAX 410.89 388.90
PTRATIO 18.47 18.33
B 353.87 377.10
LSTAT 12.89 10.90
NOX_cat 0.00 1.00
median Null Free With Nulls
CRIM 0.29 0.15
ZN 0.00 0.00
INDUS 9.90 7.38
CHAS 0.00 0.00
RM 6.18 6.31
AGE 78.10 72.70
DIS 3.09 3.95
RAD 5.00 5.00
TAX 334.00 307.00
PTRATIO 19.10 18.60
B 391.13 392.52
LSTAT 11.66 9.45
NOX_cat 0.00 1.00
Perc75 Null Free With Nulls
CRIM 3.69 1.61
ZN 12.50 25.00
INDUS 18.10 13.92
CHAS 0.00 0.00
RM 6.59 6.79
AGE 94.30 93.30
DIS 5.10 5.72
RAD 24.00 8.00
TAX 666.00 432.00
PTRATIO 20.20 20.20
B 396.24 395.59
LSTAT 17.12 14.10
NOX_cat 0.00 1.00
Perc90 Null Free With Nulls
CRIM 11.00 7.99
ZN 40.00 70.00
INDUS 19.58 18.10
CHAS 0.00 0.00
RM 7.16 7.04
AGE 98.80 97.40
DIS 6.75 7.83
RAD 24.00 24.00
TAX 666.00 666.00
PTRATIO 20.90 20.20
B 396.90 396.90
LSTAT 23.28 18.80
NOX_cat 0.00 1.00
max Null Free With Nulls
CRIM 88.98 24.39
ZN 100.00 90.00
INDUS 27.74 18.10
CHAS 1.00 1.00
RM 8.78 8.25
AGE 100.00 100.00
DIS 10.71 12.13
RAD 24.00 24.00
TAX 711.00 666.00
PTRATIO 22.00 22.00
B 396.90 396.90
LSTAT 37.97 28.28
NOX_cat 0.00 1.00

Get data from Multi-index dataframe based on numpy array

From the following dataframe:
dim_0 dim_1
0 0 40.54 23.40 6.70 1.70 1.82 0.96 1.62
1 175.89 20.24 7.78 1.55 1.45 0.80 1.44
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 0 21.38 24.00 5.90 1.60 2.55 1.50 2.36
1 130.29 18.40 8.49 1.52 1.45 0.80 1.47
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0 6.30 25.70 5.60 1.70 2.16 1.16 1.87
1 73.45 21.49 6.88 1.61 1.61 0.94 1.63
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 0 16.64 25.70 5.70 1.60 2.17 1.12 1.76
1 125.89 19.10 7.52 1.43 1.44 0.78 1.40
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 0 41.38 24.70 5.60 1.50 2.08 1.16 1.85
1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 0 180.59 16.40 3.80 1.10 4.63 3.86 5.71
1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0 13.59 24.40 6.10 1.70 2.62 1.51 2.36
1 103.19 19.02 8.70 1.53 1.48 0.76 1.38
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
7 0 3.15 24.70 5.60 1.50 2.14 1.22 2.00
1 55.90 23.10 6.07 1.50 1.86 1.12 1.87
2 208.04 20.39 6.82 1.35 1.47 0.95 1.67
How can I get only the rows from dim_01 that match the array [1 0 0 1 2 0 1 2]?
Desired result is:
0 175.89 20.24 7.78 1.55 1.45 0.80 1.44
1 21.38 24.00 5.90 1.60 2.55 1.50 2.36
2 6.30 25.70 5.60 1.70 2.16 1.16 1.87
3 125.89 19.10 7.52 1.43 1.44 0.78 1.40
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 180.59 16.40 3.80 1.10 4.63 3.86 5.71
7 103.19 19.02 8.70 1.53 1.48 0.76 1.38
8 208.04 20.39 6.82 1.35 1.47 0.95 1.67
I've tried using slicing, cross-section, etc but no success.
Thanks in advance for the help.

Use MultiIndex.from_arrays and select by DataFrame.loc:
arr = np.array([1, 0, 0, 1, 2, 0, 1 ,2])
df = df.loc[pd.MultiIndex.from_arrays([df.index.levels[0], arr])]
print (df)
2 3 4 5 6 7 8
0
0 1 175.89 20.24 7.78 1.55 1.45 0.80 1.44
1 0 21.38 24.00 5.90 1.60 2.55 1.50 2.36
2 0 6.30 25.70 5.60 1.70 2.16 1.16 1.87
3 1 125.89 19.10 7.52 1.43 1.44 0.78 1.40
4 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 0 180.59 16.40 3.80 1.10 4.63 3.86 5.71
6 1 103.19 19.02 8.70 1.53 1.48 0.76 1.38
7 2 208.04 20.39 6.82 1.35 1.47 0.95 1.67
arr = np.array([1, 0, 0, 1, 2, 0, 1 ,2])
df = df.loc[pd.MultiIndex.from_arrays([df.index.levels[0], arr])].droplevel(1)
print (df)
2 3 4 5 6 7 8
0
0 175.89 20.24 7.78 1.55 1.45 0.80 1.44
1 21.38 24.00 5.90 1.60 2.55 1.50 2.36
2 6.30 25.70 5.60 1.70 2.16 1.16 1.87
3 125.89 19.10 7.52 1.43 1.44 0.78 1.40
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 180.59 16.40 3.80 1.10 4.63 3.86 5.71
6 103.19 19.02 8.70 1.53 1.48 0.76 1.38
7 208.04 20.39 6.82 1.35 1.47 0.95 1.67

I'd go with advanced indexing using Numpy:
l = [1, 0, 0, 1, 2, 0, 1, 2]
i,j = df.index.levels
ix = np.array(l)+np.arange(i.max()+1)*(j.max()+1)
pd.DataFrame(df.to_numpy()[ix])
0 1 2 3 4 5 6
0 175.89 20.24 7.78 1.55 1.45 0.80 1.44
1 21.38 24.00 5.90 1.60 2.55 1.50 2.36
2 6.30 25.70 5.60 1.70 2.16 1.16 1.87
3 125.89 19.10 7.52 1.43 1.44 0.78 1.40
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 180.59 16.40 3.80 1.10 4.63 3.86 5.71
6 103.19 19.02 8.70 1.53 1.48 0.76 1.38
7 208.04 20.39 6.82 1.35 1.47 0.95 1.67

Try the following code:
mask_array = [1 0 0 1 2 0 1 2]
df_first = pd.DataFrame() # < It's your first array >
new_array = df_first[df_first['dim_1'].isin(mask_array)]

Pandas: Sum current and following cell in Pandas DataFrame

Python code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data as wb
stock='3988.HK'
df = wb.DataReader(stock,data_source='yahoo',start='2018-07-01')
rsi_period = 14
chg = df['Close'].diff(1)
gain = chg.mask(chg<0,0)
df['Gain'] = gain
loss = chg.mask(chg>0,0)
df['Loss'] = loss
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
df['Avg Gain'] = avg_gain
df['Avg Loss'] = avg_loss
rs = abs(avg_gain/avg_loss)
rsi = 100-(100/(1+rs))
df['RSI'] = rsi
df.reset_index(inplace=True)
df
Output:
Date High Low Open Close Volume Adj Close Gain Loss Avg Gain Avg Loss RSI
0 2018-07-03 3.87 3.76 3.83 3.84 684899302.0 3.629538 NaN NaN NaN NaN NaN
1 2018-07-04 3.91 3.84 3.86 3.86 460325574.0 3.648442 0.02 0.00 NaN NaN NaN
2 2018-07-05 3.70 3.62 3.68 3.68 292810499.0 3.680000 0.00 -0.18 NaN NaN NaN
3 2018-07-06 3.72 3.61 3.69 3.67 343653088.0 3.670000 0.00 -0.01 NaN NaN NaN
4 2018-07-09 3.75 3.68 3.70 3.69 424596186.0 3.690000 0.02 0.00 NaN NaN NaN
5 2018-07-10 3.74 3.70 3.71 3.71 327048051.0 3.710000 0.02 0.00 NaN NaN NaN
6 2018-07-11 3.65 3.61 3.63 3.64 371355401.0 3.640000 0.00 -0.07 NaN NaN NaN
7 2018-07-12 3.69 3.63 3.66 3.66 309888328.0 3.660000 0.02 0.00 NaN NaN NaN
8 2018-07-13 3.69 3.62 3.69 3.63 261928758.0 3.630000 0.00 -0.03 NaN NaN NaN
9 2018-07-16 3.63 3.57 3.61 3.62 306970074.0 3.620000 0.00 -0.01 NaN NaN NaN
10 2018-07-17 3.62 3.56 3.62 3.58 310294921.0 3.580000 0.00 -0.04 NaN NaN NaN
11 2018-07-18 3.61 3.55 3.58 3.58 334592695.0 3.580000 0.00 0.00 NaN NaN NaN
12 2018-07-19 3.61 3.56 3.61 3.56 211984563.0 3.560000 0.00 -0.02 NaN NaN NaN
13 2018-07-20 3.64 3.52 3.57 3.61 347506394.0 3.610000 0.05 0.00 NaN NaN NaN
14 2018-07-23 3.65 3.57 3.59 3.62 313125328.0 3.620000 0.01 0.00 0.010594 -0.021042 33.487100
15 2018-07-24 3.71 3.60 3.60 3.68 367627204.0 3.680000 0.06 0.00 0.015854 -0.018802 45.745967
16 2018-07-25 3.73 3.68 3.72 3.69 270460990.0 3.690000 0.01 0.00 0.015252 -0.016868 47.483263
17 2018-07-26 3.73 3.66 3.72 3.69 234388072.0 3.690000 0.00 0.00 0.013731 -0.015186 47.483263
18 2018-07-27 3.70 3.66 3.68 3.69 190039532.0 3.690000 0.00 0.00 0.012399 -0.013713 47.483263
19 2018-07-30 3.72 3.67 3.68 3.70 163971848.0 3.700000 0.01 0.00 0.012172 -0.012417 49.502851
20 2018-07-31 3.70 3.66 3.67 3.68 168486023.0 3.680000 0.00 -0.02 0.011047 -0.013118 45.716244
21 2018-08-01 3.72 3.66 3.71 3.68 199801191.0 3.680000 0.00 0.00 0.010047 -0.011930 45.716244
22 2018-08-02 3.68 3.59 3.66 3.61 307920738.0 3.610000 0.00 -0.07 0.009155 -0.017088 34.884632
23 2018-08-03 3.62 3.57 3.59 3.61 184816985.0 3.610000 0.00 0.00 0.008356 -0.015596 34.884632
24 2018-08-06 3.66 3.60 3.62 3.61 189696153.0 3.610000 0.00 0.00 0.007637 -0.014256 34.884632
25 2018-08-07 3.66 3.61 3.63 3.65 216157642.0 3.650000 0.04 0.00 0.010379 -0.013048 44.302922
26 2018-08-08 3.66 3.61 3.65 3.63 215365540.0 3.630000 0.00 -0.02 0.009511 -0.013629 41.101805
27 2018-08-09 3.66 3.59 3.59 3.65 230275455.0 3.650000 0.02 0.00 0.010378 -0.012504 45.353992
28 2018-08-10 3.66 3.60 3.65 3.62 219157328.0 3.620000 0.00 -0.03 0.009530 -0.013933 40.617049
29 2018-08-13 3.59 3.54 3.58 3.56 270620120.0 3.560000 0.00 -0.06 0.008759 -0.017658 33.158019
In this case, i want to create a new column 'max close within 14 trade days'.
'max close within 14 trade days' = Maximum 'close' within next 14 days.
For example,
in row 0, the data range should be from row 1 to row 15,
'max close within 14 trade days' = 3.86

You can do the following:
# convert to date time
df['Date'] = pd.to_datetime(df['Date'])
# calculate max for 14 days
df['max_close_within_14_trade_days'] = df['Date'].map(df.groupby([pd.Grouper(key='Date', freq='14D')])['Close'].max())
# fill missing values by previous value
df['max_close_within_14_trade_days'].fillna(method='ffill', inplace=True)
Date High Low Open Close max_close_within_14_trade_days
0 0 2018-07-03 3.87 3.76 3.83 3.86
1 1 2018-07-04 3.91 3.84 3.86 3.86
2 2 2018-07-05 3.70 3.62 3.68 3.86
3 3 2018-07-06 3.72 3.61 3.69 3.86
4 4 2018-07-09 3.75 3.68 3.70 3.86
Other solution:
df['max_close_within_14_trade_days'] = [df.loc[x+1:x+14,'Close'].max() for x in range(0, df.shape[0])]

Error in getting specific row from python pandas

I want to extract a row by name from the foll. dataframe:
Unnamed: 1 1 1.1 2 TOT
0
1 DEPTH(m) 0.01 1.24 1.52 NaN
2 BD 33kpa(t/m3) 1.60 1.60 1.60 NaN
3 SAND(%) 42.10 42.10 65.10 NaN
4 SILT(%) 37.90 37.90 16.90 NaN
5 CLAY(%) 20.00 20.00 18.00 NaN
6 ROCK(%) 12.00 12.00 12.00 NaN
7 WLS(kg/ha) 2.60 8.20 0.10 10.9
8 WLM(kg/ha) 5.00 8.30 0.00 13.4
9 WLSL(kg/ha) 0.00 3.80 0.10 3.9
10 WLSC(kg/ha) 1.10 3.50 0.00 4.6
11 WLMC(kg/ha) 2.10 3.50 0.00 5.6
12 WLSLC(kg/ha) 0.00 1.60 0.00 1.6
13 WLSLNC(kg/ha) 1.10 1.80 0.00 2.9
14 WBMC(kg/ha) 3.40 835.10 195.20 1033.7
15 WHSC(kg/ha) 66.00 8462.00 1924.00 10451.0
16 WHPC(kg/ha) 146.00 18020.00 4102.00 22269.0
17 WOC(kg/ha) 219.00 27324.00 6221.00 34.0
18 WLSN(kg/ha) 0.00 0.00 0.00 0.0
19 WLMN(kg/ha) 0.00 0.10 0.00 0.1
20 WBMN(kg/ha) 0.50 92.60 19.30 112.5
21 WHSN(kg/ha) 7.00 843.00 191.00 1041.0
22 WHPN(kg/ha) 15.00 1802.00 410.00 2227.0
23 WON(kg/ha) 22.00 2738.00 621.00 3381.0
I want to extract the row containing info on WOC(kg/ha). here is what I am doing:
df.loc['WOC(kg/ha)']
but I get the error:
*** KeyError: 'the label [WOC(kg/ha)] is not in the [index]'

You don't have that label in your index, it's in your first column the following should work:
df.loc[df['Unnamed: 1'] == 'WOC(kg/ha)']
otherwise set the index to that column and your code would work fine:
df.set_index('Unnamed: 1', inplace=True)
Also, this can be used to set index without explicitly specifying column name: df.set_index(df.columns[0], inplace=True)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python loop with groupby - python

Related

Create new columns from existing column and sort the values from smallest to largest

Dataframe Metrics Output Formatting on Python

Get data from Multi-index dataframe based on numpy array

Pandas: Sum current and following cell in Pandas DataFrame

Error in getting specific row from python pandas

Categories

Resources