interacting over a dateframe with functions - python

if I have a date frame like this:
N
EG_00_04 NEG_04_08 NEG_08_12 NEG_12_16 NEG_16_20 NEG_20_24 \
datum_von
2017-10-12 21.69 15.36 0.87 1.42 0.76 0.65
2017-10-13 11.85 8.08 1.39 2.86 1.02 0.55
2017-10-14 7.83 5.88 1.87 2.04 2.29 2.18
2017-10-15 14.64 11.28 2.62 3.35 2.13 1.25
2017-10-16 5.11 5.82 -0.30 -0.38 -0.24 -0.10
2017-10-17 12.09 9.61 0.20 1.09 0.39 0.57
And I wanna check the values that are above 0 and change them to zero when they are lower.
Not sure how should I use the function iterrows() and the loc() function to do so.

you can try:
df1 = df[df > 0].fillna(0)
as result:
In [24]: df
Out[24]:
EG_00_04 NEG_04_08 NEG_08_12 NEG_12_16 NEG_16_20 NEG_20_24 \
0 2017-10-12 21.69 15.36 0.87 1.42 0.76
1 2017-10-13 11.85 8.08 1.39 2.86 1.02
2 2017-10-14 7.83 5.88 1.87 2.04 2.29
3 2017-10-15 14.64 11.28 2.62 3.35 2.13
4 2017-10-16 5.11 5.82 -0.30 -0.38 -0.24
5 2017-10-17 12.09 9.61 0.20 1.09 0.39
datum_von
0 0.65
1 0.55
2 2.18
3 1.25
4 -0.10
5 0.57
In [25]: df1 = df[df > 0].fillna(0)
In [26]: df1
Out[26]:
EG_00_04 NEG_04_08 NEG_08_12 NEG_12_16 NEG_16_20 NEG_20_24 \
0 2017-10-12 21.69 15.36 0.87 1.42 0.76
1 2017-10-13 11.85 8.08 1.39 2.86 1.02
2 2017-10-14 7.83 5.88 1.87 2.04 2.29
3 2017-10-15 14.64 11.28 2.62 3.35 2.13
4 2017-10-16 5.11 5.82 0.00 0.00 0.00
5 2017-10-17 12.09 9.61 0.20 1.09 0.39
datum_von
0 0.65
1 0.55
2 2.18
3 1.25
4 0.00
5 0.57

clip_lower and mask solutions are good.
Here is another one with applymap:
df.applymap(lambda x: max(0.0, x))

Use clip_lower:
df = df.clip_lower(0)
print (df)
G_00_04 NEG_04_08 NEG_08_12 NEG_12_16 NEG_16_20 NEG_20_24
datum_von
2017-10-12 21.69 15.36 0.87 1.42 0.76 0.65
2017-10-13 11.85 8.08 1.39 2.86 1.02 0.55
2017-10-14 7.83 5.88 1.87 2.04 2.29 2.18
2017-10-15 14.64 11.28 2.62 3.35 2.13 1.25
2017-10-16 5.11 5.82 0.00 0.00 0.00 0.00
2017-10-17 12.09 9.61 0.20 1.09 0.39 0.57
If first column is not index:
df = df.set_index('datum_von').clip_lower(0)
print (df)
G_00_04 NEG_04_08 NEG_08_12 NEG_12_16 NEG_16_20 NEG_20_24
datum_von
2017-10-12 21.69 15.36 0.87 1.42 0.76 0.65
2017-10-13 11.85 8.08 1.39 2.86 1.02 0.55
2017-10-14 7.83 5.88 1.87 2.04 2.29 2.18
2017-10-15 14.64 11.28 2.62 3.35 2.13 1.25
2017-10-16 5.11 5.82 0.00 0.00 0.00 0.00
2017-10-17 12.09 9.61 0.20 1.09 0.39 0.57
Alternative solution:
df = df.mask(df < 0, 0)
print (df)
G_00_04 NEG_04_08 NEG_08_12 NEG_12_16 NEG_16_20 NEG_20_24
datum_von
2017-10-12 21.69 15.36 0.87 1.42 0.76 0.65
2017-10-13 11.85 8.08 1.39 2.86 1.02 0.55
2017-10-14 7.83 5.88 1.87 2.04 2.29 2.18
2017-10-15 14.64 11.28 2.62 3.35 2.13 1.25
2017-10-16 5.11 5.82 0.00 0.00 0.00 0.00
2017-10-17 12.09 9.61 0.20 1.09 0.39 0.57
Timings:
df = pd.concat([df]*10000).reset_index(drop=True)
In [240]: %timeit (df.applymap(lambda x: max(0.0, x)))
10 loops, best of 3: 164 ms per loop
In [241]: %timeit (df[df > 0].fillna(0))
100 loops, best of 3: 7.05 ms per loop
In [242]: %timeit (df.clip_lower(0))
1000 loops, best of 3: 1.96 ms per loop
In [243]: %timeit df.mask(df < 0, 0)
100 loops, best of 3: 5.18 ms per loop

Related

pd.concat ends up with ValueError: no types given

I'm trying to merge two dfs (basically th same df at different time) using pd.concat.
here is my code:
Aujourdhui = datetime.datetime.now()
Aujourdhui = (Aujourdhui.strftime("%X"))
PerfsL1 = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0]
PerfsL1.columns = ['Équipe', 'Used_players', 'age', 'Possesion', "nb_matchs", "Starts", "Min",
'90s','Buts','Assists', 'No_penaltis', 'Penaltis', 'Penaltis_tentes',
'Cartons_jaunes', 'Cartons_rouges', 'Buts/90mn','Assists/90mn', 'B+A /90mn',
'NoPenaltis/90mn', 'B+A+P/90mn','Exp_buts','Exp_NoPenaltis', 'Exp_Assists', 'Exp_NP+A',
'Exp_buts/90mn', 'Exp_Assists/90mn','Exp_B+A/90mn','Exp_NoPenaltis/90mn', 'Exp_NP+A/90mn']
PerfsL1.insert(0, "Date", Aujourdhui)
print(PerfsL1)
PerfsL12 = pd.read_csv('Ligue_1_Perfs.csv', index_col=0)
print(PerfsL12)
PerfsL1 = pd.concat([PerfsL1, PerfsL12], ignore_index = True)
print (PerfsL1)
I successfully managed to get both df individually which are sharing the same columns, but I can't merge them, getting
ValueError: no types given.
Do you have an idea where it could be coming from ?
EDIT
Here are both dataframes:
'Ligue_1.csv'
Date Équipe Used_players age Possesion nb_matchs ... Exp_NP+A Exp_buts/90mn Exp_Assists/90mn Exp_B+A/90mn Exp_NoPenaltis/90mn Exp_NP+A/90mn
0 00:37:48 Ajaccio 18 29.1 34.5 2 ... 1.6 0.97 0.24 1.20 0.57 0.81
1 00:37:48 Angers 18 26.8 55.0 2 ... 5.9 1.78 1.18 2.96 1.78 2.96
2 00:37:48 Auxerre 15 29.4 39.5 2 ... 3.3 0.83 0.80 1.63 0.83 1.63
3 00:37:48 Brest 18 26.8 42.5 2 ... 5.0 1.67 1.23 2.90 1.28 2.51
4 00:37:48 Clermont Foot 18 27.8 48.5 2 ... 1.8 0.89 0.38 1.27 0.50 0.88
5 00:37:48 Lens 16 26.2 63.0 2 ... 5.6 1.92 1.29 3.21 1.53 2.82
6 00:37:48 Lille 18 27.2 65.0 2 ... 7.3 2.02 1.65 3.66 2.02 3.66
7 00:37:48 Lorient 14 25.8 36.0 1 ... 0.6 0.37 0.26 0.63 0.37 0.63
8 00:37:48 Lyon 15 26.0 68.0 1 ... 1.2 1.52 0.49 2.00 0.73 1.22
9 00:37:48 Marseille 17 26.9 55.0 2 ... 4.9 1.40 1.03 2.43 1.40 2.43
10 00:37:48 Monaco 19 24.8 40.5 2 ... 7.1 2.74 1.19 3.93 2.35 3.54
11 00:37:48 Montpellier 19 25.5 47.5 2 ... 3.2 0.93 0.66 1.59 0.93 1.59
12 00:37:48 Nantes 16 26.9 40.5 2 ... 3.9 1.37 0.60 1.97 1.37 1.97
13 00:37:48 Nice 18 25.9 54.0 2 ... 3.1 1.25 0.69 1.94 0.86 1.55
14 00:37:48 Paris S-G 18 27.6 60.0 2 ... 8.1 3.05 1.76 4.81 2.27 4.03
print(PerfsL1 = pd.read_html('https://fbref.com/fr/comps/13/stats/Statistiques-Ligue-1#all_stats_standard', header=1)[0])
Date Équipe Used_players age Possesion nb_matchs ... Exp_NP+A Exp_buts/90mn Exp_Assists/90mn Exp_B+A/90mn Exp_NoPenaltis/90mn Exp_NP+A/90mn
0 09:56:18 Ajaccio 18 29.1 34.5 2 ... 1.6 0.97 0.24 1.20 0.57 0.81
1 09:56:18 Angers 18 26.8 55.0 2 ... 5.9 1.78 1.18 2.96 1.78 2.96
2 09:56:18 Auxerre 15 29.4 39.5 2 ... 3.3 0.83 0.80 1.63 0.83 1.63
3 09:56:18 Brest 18 26.8 42.5 2 ... 5.0 1.67 1.23 2.90 1.28 2.51
4 09:56:18 Clermont Foot 18 27.8 48.5 2 ... 1.8 0.89 0.38 1.27 0.50 0.88
5 09:56:18 Lens 16 26.2 63.0 2 ... 5.6 1.92 1.29 3.21 1.53 2.82
6 09:56:18 Lille 18 27.2 65.0 2 ... 7.3 2.02 1.65 3.66 2.02 3.66
7 09:56:18 Lorient 14 25.8 36.0 1 ... 0.6 0.37 0.26 0.63 0.37 0.63
8 09:56:18 Lyon 15 26.0 68.0 1 ... 1.2 1.52 0.49 2.00 0.73 1.22
9 09:56:18 Marseille 17 26.9 55.0 2 ... 4.9 1.40 1.03 2.43 1.40 2.43
10 09:56:18 Monaco 19 24.8 40.5 2 ... 7.1 2.74 1.19 3.93 2.35 3.54
11 09:56:18 Montpellier 19 25.5 47.5 2 ... 3.2 0.93 0.66 1.59 0.93 1.59
12 09:56:18 Nantes 16 26.9 40.5 2 ... 3.9 1.37 0.60 1.97 1.37 1.97
13 09:56:18 Nice 18 25.9 54.0 2 ... 3.1 1.25 0.69 1.94 0.86 1.55
Thanks you for your support and have a great day !
Your code should work.
Nevertheless, try this before the concat:
PerfsL1["Date"] = pd.to_datetime(PerfsL1["Date"], format="%X", errors=‘coerce’)
I finally managed to concat both tables.
The solution was to put but both as csv before:
table1 = pd.read_html ('http://.......1........com)
table1.to_csv ('C://.....1........')
table1 = pd.read_csv('C://.....1........')
table2 = pd.read_html ('http://.......2........com)
table2.to_csv ('C://.....2........')
table2 = pd.read_csv('C://.....2........')
x = pd.concat([table2, table1])
And now it works perfectly !
Thanks for your help !

Get data from Multi-index dataframe based on numpy array

From the following dataframe:
dim_0 dim_1
0 0 40.54 23.40 6.70 1.70 1.82 0.96 1.62
1 175.89 20.24 7.78 1.55 1.45 0.80 1.44
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
1 0 21.38 24.00 5.90 1.60 2.55 1.50 2.36
1 130.29 18.40 8.49 1.52 1.45 0.80 1.47
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0 6.30 25.70 5.60 1.70 2.16 1.16 1.87
1 73.45 21.49 6.88 1.61 1.61 0.94 1.63
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
3 0 16.64 25.70 5.70 1.60 2.17 1.12 1.76
1 125.89 19.10 7.52 1.43 1.44 0.78 1.40
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
4 0 41.38 24.70 5.60 1.50 2.08 1.16 1.85
1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 0 180.59 16.40 3.80 1.10 4.63 3.86 5.71
1 0.00 0.00 0.00 0.00 0.00 0.00 0.00
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
6 0 13.59 24.40 6.10 1.70 2.62 1.51 2.36
1 103.19 19.02 8.70 1.53 1.48 0.76 1.38
2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
7 0 3.15 24.70 5.60 1.50 2.14 1.22 2.00
1 55.90 23.10 6.07 1.50 1.86 1.12 1.87
2 208.04 20.39 6.82 1.35 1.47 0.95 1.67
How can I get only the rows from dim_01 that match the array [1 0 0 1 2 0 1 2]?
Desired result is:
0 175.89 20.24 7.78 1.55 1.45 0.80 1.44
1 21.38 24.00 5.90 1.60 2.55 1.50 2.36
2 6.30 25.70 5.60 1.70 2.16 1.16 1.87
3 125.89 19.10 7.52 1.43 1.44 0.78 1.40
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 180.59 16.40 3.80 1.10 4.63 3.86 5.71
7 103.19 19.02 8.70 1.53 1.48 0.76 1.38
8 208.04 20.39 6.82 1.35 1.47 0.95 1.67
I've tried using slicing, cross-section, etc but no success.
Thanks in advance for the help.
Use MultiIndex.from_arrays and select by DataFrame.loc:
arr = np.array([1, 0, 0, 1, 2, 0, 1 ,2])
df = df.loc[pd.MultiIndex.from_arrays([df.index.levels[0], arr])]
print (df)
2 3 4 5 6 7 8
0
0 1 175.89 20.24 7.78 1.55 1.45 0.80 1.44
1 0 21.38 24.00 5.90 1.60 2.55 1.50 2.36
2 0 6.30 25.70 5.60 1.70 2.16 1.16 1.87
3 1 125.89 19.10 7.52 1.43 1.44 0.78 1.40
4 2 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 0 180.59 16.40 3.80 1.10 4.63 3.86 5.71
6 1 103.19 19.02 8.70 1.53 1.48 0.76 1.38
7 2 208.04 20.39 6.82 1.35 1.47 0.95 1.67
arr = np.array([1, 0, 0, 1, 2, 0, 1 ,2])
df = df.loc[pd.MultiIndex.from_arrays([df.index.levels[0], arr])].droplevel(1)
print (df)
2 3 4 5 6 7 8
0
0 175.89 20.24 7.78 1.55 1.45 0.80 1.44
1 21.38 24.00 5.90 1.60 2.55 1.50 2.36
2 6.30 25.70 5.60 1.70 2.16 1.16 1.87
3 125.89 19.10 7.52 1.43 1.44 0.78 1.40
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 180.59 16.40 3.80 1.10 4.63 3.86 5.71
6 103.19 19.02 8.70 1.53 1.48 0.76 1.38
7 208.04 20.39 6.82 1.35 1.47 0.95 1.67
I'd go with advanced indexing using Numpy:
l = [1, 0, 0, 1, 2, 0, 1, 2]
i,j = df.index.levels
ix = np.array(l)+np.arange(i.max()+1)*(j.max()+1)
pd.DataFrame(df.to_numpy()[ix])
0 1 2 3 4 5 6
0 175.89 20.24 7.78 1.55 1.45 0.80 1.44
1 21.38 24.00 5.90 1.60 2.55 1.50 2.36
2 6.30 25.70 5.60 1.70 2.16 1.16 1.87
3 125.89 19.10 7.52 1.43 1.44 0.78 1.40
4 0.00 0.00 0.00 0.00 0.00 0.00 0.00
5 180.59 16.40 3.80 1.10 4.63 3.86 5.71
6 103.19 19.02 8.70 1.53 1.48 0.76 1.38
7 208.04 20.39 6.82 1.35 1.47 0.95 1.67
Try the following code:
mask_array = [1 0 0 1 2 0 1 2]
df_first = pd.DataFrame() # < It's your first array >
new_array = df_first[df_first['dim_1'].isin(mask_array)]

Converting rows from one pandas dataframe column to multiple columns without numeric data? [duplicate]

I have DataFrame that looks like this
exec ms tp lu ru
0 exec1 16.0 240.87 2.30 0.85
1 exec1 16.0 243.72 2.35 0.84
2 exec1 16.0 234.16 2.38 0.92
3 exec1 16.0 244.71 2.35 0.84
4 exec1 16.0 240.74 2.39 0.90
5 exec1 128.0 1686.78 2.09 0.69
6 exec1 128.0 1704.36 2.00 0.44
7 exec1 128.0 1686.45 2.07 0.60
8 exec1 128.0 1722.61 2.07 0.45
9 exec1 128.0 1726.15 2.08 0.50
10 exec1 1024.0 5754.92 2.23 0.93
11 exec1 1024.0 5740.71 2.24 0.93
12 exec1 1024.0 5751.58 2.24 0.96
13 exec1 1024.0 5819.63 2.23 0.92
14 exec1 1024.0 5797.03 2.22 0.96
15 exec1 8192.0 37833.45 1.91 3.87
16 exec1 8192.0 38154.95 2.00 3.87
17 exec1 8192.0 38178.19 2.02 3.85
18 exec1 8192.0 38152.86 1.95 3.84
19 exec1 8192.0 35209.98 1.80 3.65
20 exec1 16384.0 38109.76 1.81 3.84
21 exec1 16384.0 38059.07 1.76 3.90
22 exec1 16384.0 36683.24 1.54 3.71
23 exec1 16384.0 37908.00 1.73 3.85
24 exec1 16384.0 37014.79 1.71 3.75
and I would like to make columns from ms for data from tp, lu and ru and have them as hierarchical columns and use exec as the index like this:
lu ru tp
exec 16.0 128.0 1024.0 8192.0 16384.0 16.0 128.0 1024.0 8192.0 16384.0 16.0 128.0 1024.0 8192.0 16384.0
exec1 2.30 2.09 2.23 1.91 1.81 0.85 0.69 0.93 3.87 3.84 240.87 1686.78 5754.92 37833.45 38109.76
exec1 2.35 2.00 2.24 2.00 1.76 0.84 0.44 0.93 3.87 3.90 243.72 1704.36 5740.71 38154.95 38059.07
exec1 2.38 2.07 2.24 2.02 1.54 0.92 0.60 0.96 3.85 3.71 234.16 1686.45 5751.58 38178.19 36683.24
exec1 2.35 2.07 2.23 1.95 1.73 0.84 0.45 0.92 3.84 3.85 244.71 1722.61 5819.63 38152.86 37908.00
exec1 2.39 2.08 2.22 1.80 1.71 0.90 0.50 0.96 3.65 3.75 240.74 1726.15 5797.03 35209.98 37014.79
I tried using pd.pivot_table but it creates unwanted nans.
May need groupby +cumcount create a additional key , then do pivot transform , here I am using unstack , if you need check pivot , personally think this explain better than the official document
df.assign(key=df.groupby(['exec','ms']).cumcount()).set_index(['exec','ms','key']).unstack([1])

Pandas: Sum current and following cell in Pandas DataFrame

Python code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from pandas_datareader import data as wb
stock='3988.HK'
df = wb.DataReader(stock,data_source='yahoo',start='2018-07-01')
rsi_period = 14
chg = df['Close'].diff(1)
gain = chg.mask(chg<0,0)
df['Gain'] = gain
loss = chg.mask(chg>0,0)
df['Loss'] = loss
avg_gain = gain.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
avg_loss = loss.ewm(com = rsi_period-1,min_periods=rsi_period).mean()
df['Avg Gain'] = avg_gain
df['Avg Loss'] = avg_loss
rs = abs(avg_gain/avg_loss)
rsi = 100-(100/(1+rs))
df['RSI'] = rsi
df.reset_index(inplace=True)
df
Output:
Date High Low Open Close Volume Adj Close Gain Loss Avg Gain Avg Loss RSI
0 2018-07-03 3.87 3.76 3.83 3.84 684899302.0 3.629538 NaN NaN NaN NaN NaN
1 2018-07-04 3.91 3.84 3.86 3.86 460325574.0 3.648442 0.02 0.00 NaN NaN NaN
2 2018-07-05 3.70 3.62 3.68 3.68 292810499.0 3.680000 0.00 -0.18 NaN NaN NaN
3 2018-07-06 3.72 3.61 3.69 3.67 343653088.0 3.670000 0.00 -0.01 NaN NaN NaN
4 2018-07-09 3.75 3.68 3.70 3.69 424596186.0 3.690000 0.02 0.00 NaN NaN NaN
5 2018-07-10 3.74 3.70 3.71 3.71 327048051.0 3.710000 0.02 0.00 NaN NaN NaN
6 2018-07-11 3.65 3.61 3.63 3.64 371355401.0 3.640000 0.00 -0.07 NaN NaN NaN
7 2018-07-12 3.69 3.63 3.66 3.66 309888328.0 3.660000 0.02 0.00 NaN NaN NaN
8 2018-07-13 3.69 3.62 3.69 3.63 261928758.0 3.630000 0.00 -0.03 NaN NaN NaN
9 2018-07-16 3.63 3.57 3.61 3.62 306970074.0 3.620000 0.00 -0.01 NaN NaN NaN
10 2018-07-17 3.62 3.56 3.62 3.58 310294921.0 3.580000 0.00 -0.04 NaN NaN NaN
11 2018-07-18 3.61 3.55 3.58 3.58 334592695.0 3.580000 0.00 0.00 NaN NaN NaN
12 2018-07-19 3.61 3.56 3.61 3.56 211984563.0 3.560000 0.00 -0.02 NaN NaN NaN
13 2018-07-20 3.64 3.52 3.57 3.61 347506394.0 3.610000 0.05 0.00 NaN NaN NaN
14 2018-07-23 3.65 3.57 3.59 3.62 313125328.0 3.620000 0.01 0.00 0.010594 -0.021042 33.487100
15 2018-07-24 3.71 3.60 3.60 3.68 367627204.0 3.680000 0.06 0.00 0.015854 -0.018802 45.745967
16 2018-07-25 3.73 3.68 3.72 3.69 270460990.0 3.690000 0.01 0.00 0.015252 -0.016868 47.483263
17 2018-07-26 3.73 3.66 3.72 3.69 234388072.0 3.690000 0.00 0.00 0.013731 -0.015186 47.483263
18 2018-07-27 3.70 3.66 3.68 3.69 190039532.0 3.690000 0.00 0.00 0.012399 -0.013713 47.483263
19 2018-07-30 3.72 3.67 3.68 3.70 163971848.0 3.700000 0.01 0.00 0.012172 -0.012417 49.502851
20 2018-07-31 3.70 3.66 3.67 3.68 168486023.0 3.680000 0.00 -0.02 0.011047 -0.013118 45.716244
21 2018-08-01 3.72 3.66 3.71 3.68 199801191.0 3.680000 0.00 0.00 0.010047 -0.011930 45.716244
22 2018-08-02 3.68 3.59 3.66 3.61 307920738.0 3.610000 0.00 -0.07 0.009155 -0.017088 34.884632
23 2018-08-03 3.62 3.57 3.59 3.61 184816985.0 3.610000 0.00 0.00 0.008356 -0.015596 34.884632
24 2018-08-06 3.66 3.60 3.62 3.61 189696153.0 3.610000 0.00 0.00 0.007637 -0.014256 34.884632
25 2018-08-07 3.66 3.61 3.63 3.65 216157642.0 3.650000 0.04 0.00 0.010379 -0.013048 44.302922
26 2018-08-08 3.66 3.61 3.65 3.63 215365540.0 3.630000 0.00 -0.02 0.009511 -0.013629 41.101805
27 2018-08-09 3.66 3.59 3.59 3.65 230275455.0 3.650000 0.02 0.00 0.010378 -0.012504 45.353992
28 2018-08-10 3.66 3.60 3.65 3.62 219157328.0 3.620000 0.00 -0.03 0.009530 -0.013933 40.617049
29 2018-08-13 3.59 3.54 3.58 3.56 270620120.0 3.560000 0.00 -0.06 0.008759 -0.017658 33.158019
In this case, i want to create a new column 'max close within 14 trade days'.
'max close within 14 trade days' = Maximum 'close' within next 14 days.
For example,
in row 0, the data range should be from row 1 to row 15,
'max close within 14 trade days' = 3.86
You can do the following:
# convert to date time
df['Date'] = pd.to_datetime(df['Date'])
# calculate max for 14 days
df['max_close_within_14_trade_days'] = df['Date'].map(df.groupby([pd.Grouper(key='Date', freq='14D')])['Close'].max())
# fill missing values by previous value
df['max_close_within_14_trade_days'].fillna(method='ffill', inplace=True)
Date High Low Open Close max_close_within_14_trade_days
0 0 2018-07-03 3.87 3.76 3.83 3.86
1 1 2018-07-04 3.91 3.84 3.86 3.86
2 2 2018-07-05 3.70 3.62 3.68 3.86
3 3 2018-07-06 3.72 3.61 3.69 3.86
4 4 2018-07-09 3.75 3.68 3.70 3.86
Other solution:
df['max_close_within_14_trade_days'] = [df.loc[x+1:x+14,'Close'].max() for x in range(0, df.shape[0])]

Split data frame based on specific value in Python

I have a dataframe as below:
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
10000 .90 1.10 1.30 1.50 2.10 3.10 5.60 8.40 15.80
15000 1.35 1.65 1.95 2.25 3.15 4.65 8.40 12.60 23.70
20000 1.80 2.20 2.60 3.00 4.20 6.20 11.20 16.80 31.60
25000 2.25 2.75 3.25 3.75 5.25 7.75 14.00 21.00 39.50
30000 2.70 3.30 3.90 4.50 6.30 9.30 16.80 25.20 47.40
35000 3.15 3.85 4.55 5.25 7.35 10.85 19.60 29.40 55.30
40000 3.60 4.40 5.20 6.00 8.40 12.40 22.40 33.60 63.20
45000 4.05 4.95 5.85 6.75 9.45 13.95 25.20 37.80 71.10
50000 4.50 5.50 6.50 7.50 10.50 15.50 28.00 42.00 79.00
10000 .60 .80 1.00 1.20 1.80 2.80 5.30 8.10 15.50
15000 .90 1.20 1.50 1.80 2.70 4.20 7.95 12.15 23.25
20000 1.20 1.60 2.00 2.40 3.60 5.60 10.60 16.20 31.00
25000 1.50 2.00 2.50 3.00 4.50 7.00 13.25 20.25 38.75
30000 1.80 2.40 3.00 3.60 5.40 8.40 15.90 24.30 46.50
35000 2.10 2.80 3.50 4.20 6.30 9.80 18.55 28.35 54.25
40000 2.40 3.20 4.00 4.80 7.20 11.20 21.20 32.40 62.00
45000 2.70 3.60 4.50 5.40 8.10 12.60 23.85 36.45 69.75
50000 3.00 4.00 5.00 6.00 9.00 14.00 26.50 40.50 77.50
1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
4000 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
5000 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
6000 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17
7000 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37
8000 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56
9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95
Now I would like to split them into 3 dataframes based on the 'Size'
df1: From 10000 - before next occurrence of 10000
df2: Second 10000 - before 1000
df3: From 1000 to end
Otherwise,it is fine to have a temporary variable (temp column) in the same dataframe specifying categories like S1,S2 and S3 respectively for above ranges.
Could anyone guide me how to go about this?
Regards
Assumng that you want to break on the decreases, you could use the compare-cumsum-groupby pattern:
parts = list(df.groupby((df["Size"].diff() < 0).cumsum()))
which gives me (suppressing boring rows in the middle)
>>> for key, group in parts:
... print(key)
... print(group)
... print("----")
...
0
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
0 10000 0.90 1.10 1.30 1.50 2.10 3.10 5.6 8.4 15.8
1 15000 1.35 1.65 1.95 2.25 3.15 4.65 8.4 12.6 23.7
2 20000 1.80 2.20 2.60 3.00 4.20 6.20 11.2 16.8 31.6
[...]
7 45000 4.05 4.95 5.85 6.75 9.45 13.95 25.2 37.8 71.1
8 50000 4.50 5.50 6.50 7.50 10.50 15.50 28.0 42.0 79.0
----
1
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.6 0.8 1.0 1.2 1.8 2.8 5.30 8.10 15.50
10 15000 0.9 1.2 1.5 1.8 2.7 4.2 7.95 12.15 23.25
11 20000 1.2 1.6 2.0 2.4 3.6 5.6 10.60 16.20 31.00
[...]
16 45000 2.7 3.6 4.5 5.4 8.1 12.6 23.85 36.45 69.75
17 50000 3.0 4.0 5.0 6.0 9.0 14.0 26.50 40.50 77.50
----
2
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
18 1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
19 2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
20 3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
[...]
26 9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
27 10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.90
----
Not so elegant but this works:
In [259]:
ranges=[]
first = df.index[0]
criteria = df.index[df['Size'].diff() < 0]
for idx in criteria:
ranges.append((first, idx))
first += idx
ranges
Out[259]:
[(0, 9), (9, 18)]
In [261]:
splits = []
for r in ranges:
splits.append(df.iloc[r[0]:r[1]])
splits.append(df.iloc[ranges[-1][0]:])
splits
Out[261]:
[ Size C1 C2 C3 C4 C5 C6 C7 C8 C9
0 10000 0.90 1.10 1.30 1.50 2.10 3.10 5.6 8.4 15.8
1 15000 1.35 1.65 1.95 2.25 3.15 4.65 8.4 12.6 23.7
2 20000 1.80 2.20 2.60 3.00 4.20 6.20 11.2 16.8 31.6
3 25000 2.25 2.75 3.25 3.75 5.25 7.75 14.0 21.0 39.5
4 30000 2.70 3.30 3.90 4.50 6.30 9.30 16.8 25.2 47.4
5 35000 3.15 3.85 4.55 5.25 7.35 10.85 19.6 29.4 55.3
6 40000 3.60 4.40 5.20 6.00 8.40 12.40 22.4 33.6 63.2
7 45000 4.05 4.95 5.85 6.75 9.45 13.95 25.2 37.8 71.1
8 50000 4.50 5.50 6.50 7.50 10.50 15.50 28.0 42.0 79.0,
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.6 0.8 1.0 1.2 1.8 2.8 5.30 8.10 15.50
10 15000 0.9 1.2 1.5 1.8 2.7 4.2 7.95 12.15 23.25
11 20000 1.2 1.6 2.0 2.4 3.6 5.6 10.60 16.20 31.00
12 25000 1.5 2.0 2.5 3.0 4.5 7.0 13.25 20.25 38.75
13 30000 1.8 2.4 3.0 3.6 5.4 8.4 15.90 24.30 46.50
14 35000 2.1 2.8 3.5 4.2 6.3 9.8 18.55 28.35 54.25
15 40000 2.4 3.2 4.0 4.8 7.2 11.2 21.20 32.40 62.00
16 45000 2.7 3.6 4.5 5.4 8.1 12.6 23.85 36.45 69.75
17 50000 3.0 4.0 5.0 6.0 9.0 14.0 26.50 40.50 77.50,
Size C1 C2 C3 C4 C5 C6 C7 C8 C9
9 10000 0.60 0.80 1.00 1.20 1.80 2.80 5.30 8.10 15.50
10 15000 0.90 1.20 1.50 1.80 2.70 4.20 7.95 12.15 23.25
11 20000 1.20 1.60 2.00 2.40 3.60 5.60 10.60 16.20 31.00
12 25000 1.50 2.00 2.50 3.00 4.50 7.00 13.25 20.25 38.75
13 30000 1.80 2.40 3.00 3.60 5.40 8.40 15.90 24.30 46.50
14 35000 2.10 2.80 3.50 4.20 6.30 9.80 18.55 28.35 54.25
15 40000 2.40 3.20 4.00 4.80 7.20 11.20 21.20 32.40 62.00
16 45000 2.70 3.60 4.50 5.40 8.10 12.60 23.85 36.45 69.75
17 50000 3.00 4.00 5.00 6.00 9.00 14.00 26.50 40.50 77.50
18 1000 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20 0.20
19 2000 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39 0.39
20 3000 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59 0.59
21 4000 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78 0.78
22 5000 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98 0.98
23 6000 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17 1.17
24 7000 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37 1.37
25 8000 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56 1.56
26 9000 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76 1.76
27 10000 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95 1.95]
So firstly this looks to see when the size stops increasing:
df['Size'].diff() < 0
and we use to mask the index, we then iterate over these ranges to create a list of tuple ranges.
We iterate over these ranges to slice the df in the last step.

Categories