Im currently trying to calculate the simple moving average on a dataset of several stocks. Im trying the code on just two companies (and 4 days time) for simplicity to get it working, but there seem to be some problem with the output. Below is my code.
for index, row in df3.iloc[4:].iterrows():
if df3.loc[index,'CompanyId'] == df3.loc[index-4,'CompanyId']:
df3['SMA4'] = df3.iloc[:,1].rolling(window=4).mean()
else:
df3['SMA4'] = 0
And here is the output:Output
The dataframe is sorted by date and company id. So what needs to happen is that when the company id are not equal as stated in the code, the output should be zero since i cant calculate a moving average of two different companies. Instead it output a moving average over both companies like at row 7,8,9.
Use groupby.rolling
df['SMA4']=df.groupby('CompanyId',sort=False).rolling(window=4).Price.mean().reset_index(drop='CompanyId')
print(df)
CompanyId Price SMA4
0 1 75 NaN
1 1 74 NaN
2 1 77 NaN
3 1 78 76.00
4 1 80 77.25
5 1 79 78.50
6 1 80 79.25
7 0 10 NaN
8 0 9 NaN
9 0 12 NaN
10 0 11 10.50
11 0 11 10.75
12 0 8 10.50
13 0 9 9.75
14 0 8 9.00
15 0 8 8.25
16 0 11 9.00
While ansev is right that you should use the specialized function because manual loops are much slower, I want to show why your code didn't work:
In both the if branch and the else branch, the entire SMA4 column gets assigned to (df3['SMA4']), and because on the last run through the loop, the if statement is true, so the else statement doesn't have any effect and SMA4 is never 0. So to fix that you could first create the column populated with rolling averages (note that this is not in a for loop):
df3['SMA4'] = df3.iloc[:,1].rolling(window=4).mean()
And then you run the loop to set invalid rows to 0 (though nan would be better. I kept in the other bugs, assuming that the numbers in ansev's answer are correct):
for index, row in df3.iloc[4:].iterrows():
if df3.loc[index,'CompanyId'] != df3.loc[index-4,'CompanyId']:
df3.loc[index,'SMA4'] = 0
Output (probably still buggy):
CompanyId Price SMA4
0 1 75 NaN
1 1 74 NaN
2 1 77 NaN
3 1 78 76.00
4 1 80 77.25
5 1 79 78.50
6 1 80 79.25
7 2 10 0.00
8 2 9 0.00
9 2 12 0.00
10 2 11 0.00
11 2 11 10.75
12 2 8 10.50
13 2 9 9.75
14 2 8 9.00
15 2 8 8.25
16 2 11 9.00
Related
Let's say I have the following sample dataframe:
df = pd.DataFrame({'depth': list(range(0, 21)),
'time': list(range(0, 21)),
'metric': random.choices(range(10), k=21)})
df
Out[65]:
depth time metric
0 0 0 2
1 1 1 3
2 2 2 8
3 3 3 0
4 4 4 8
5 5 5 9
6 6 6 5
7 7 7 1
8 8 8 6
9 9 9 6
10 10 10 7
11 11 11 2
12 12 12 7
13 13 13 0
14 14 14 6
15 15 15 0
16 16 16 5
17 17 17 6
18 18 18 9
19 19 19 6
20 20 20 8
I want to average every ten rows of the "metric" column (preserving the first row as is) and pulling the tenth item from the depth and time columns. For example:
depth time metric
0 0 0 2
10 10 10 5.3
20 20 20 4.9
I know that groupby is usually used in these situations, but I do not know how to tweak it to get my desired outcome:
df[['metric']].groupby(df.index //10).mean()
Out[66]:
metric
0 4.8
1 4.8
2 8.0
#BENY's answer is on the right track but not quite right. Should be:
df.groupby((df.index+9)//10).agg({'depth':'last','time':'last','metric':'mean'})
You can do rolling with reindex+ffill
df.rolling(10).mean().reindex(df.index[::10]).fillna(df)
depth time metric
0 0.0 0.0 2.0
10 5.5 5.5 5.3
20 15.5 15.5 4.9
Or to match output for depth and time:
out = (df.assign(metric=df['metric'].rolling(10).mean()
.reindex(df.index[::10]).fillna(df['metric']))
.dropna(subset=['metric']))
print(out)
depth time metric
0 0 0 2.0
10 10 10 5.3
20 20 20 4.9
Let us do agg
g = df.index.isin(df.index[::10]).cumsum()[::-1]
df.groupby(g).agg({'depth':'last','time':'last','metric':'mean'})
Out[263]:
depth time metric
1 20 20 4.9
2 10 10 5.3
3 0 0 2.0
I am trying to create a new column that will list down the last recorded peak values, until the next peak comes along. For example, suppose this is my existing DataFrame:
index values
0 10
1 20
2 15
3 17
4 15
5 22
6 20
I want to get something like this:
index values last_recorded_peak
0 10 10
1 20 20
2 15 20
3 17 17
4 15 17
5 22 22
6 20 22
So far, I have tried with np.maximum.accumulate, which 'accumulates' the max value but not quite the "peaks" (some peaks might be lower than the max value).
I have also tried with scipy.signal.find_peaks which returns an array of indexes where my peaks are (in the example, index 1, 3, 5), which is not what I'm looking for.
I'm relatively new to coding, any pointer is very much appreciated!
You're on the right track, scipy.signal.find_peaks is the way I would go, you just need to work a little bit from the result:
from scipy import signal
peaks = signal.find_peaks(df['values'])[0]
df['last_recorded_peak'] = (df.assign(last_recorded_peak=float('nan'))
.last_recorded_peak
.combine_first(df.loc[peaks,'values'])
.ffill()
.combine_first(df['values']))
print(df)
index values last_recorded_peak
0 0 10 10.0
1 1 20 20.0
2 2 15 20.0
3 3 17 17.0
4 4 15 17.0
5 5 22 22.0
6 6 20 22.0
If I understand your correcly, your are looking for rolling max:
note: you might have to play around with the window size which I set on 2 for your example dataframe
df['last_recorded_peak'] = df['values'].rolling(2).max().fillna(df['values'])
Output
values last_recorded_peak
0 10 10.0
1 20 20.0
2 15 20.0
3 17 17.0
4 15 17.0
5 22 22.0
6 20 22.0
I'm coding a Pyhton script to make an inventory recalculation of a specific SKU from today over the past 365 days, given the actual stock. For that I'm using a Python Pandas Dataframe, as it is shown below:
Index DATE SUM_IN SUM_OUT
0 5/12/18 500 0
1 5/13/18 0 -403
2 5/14/18 0 -58
3 5/15/18 0 -39
4 5/16/18 100 0
5 5/17/18 0 -98
6 5/18/18 276 0
7 5/19/18 0 -139
8 5/20/18 0 -59
9 5/21/18 0 -70
The dataframe presents the sum of quantities IN and OUT of the warehouse, grouped by date. My intention is to add a column named "STOCK" that presents the stock level of the SKU of the represented day. For that, what I have is the actual stock level (index 9). So what I need is to recalculate all the levels day by day through all the dates series (From index 9 until index 0).
In Excel it's easy. I can put the actual level in the last row and just extend a the calculation until I reach the row of index 0. As presented (Column E is the formula, Column G is the desired Output):
Does someone can help me achieve this result?
I already have the stock level of the last day (i. e. 5/21/2018 is equal to 10). What I need is place the number 10 in index 9 and calculate the stock levels of the other past days, from index 8 until 0.
The desired output should be:
Index DATE TRANSACTION_IN TRANSACTION_OUT SUM_IN SUM_OUT STOCK
0 5/12/18 1 0 500 0 500
1 5/13/18 0 90 0 -403 97
2 5/14/18 0 11 0 -58 39
3 5/15/18 0 11 0 -39 0
4 5/16/18 1 0 100 0 100
5 5/17/18 0 17 0 -98 2
6 5/18/18 1 0 276 0 278
7 5/19/18 0 12 0 -139 139
8 5/20/18 0 4 0 -59 80
9 5/21/18 0 7 0 -70 10
(Updated)
last_stock = 10 # You should try another value
a = (df.SUM_IN + df.SUM_OUT).cumsum()
df["STOCK"] = a - (a.iloc[-1] - last_stock)
By using cumsum to create the key for groupby , then we using cumsum again
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()
Out[292]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 276.0
7 137.0
8 78.0
9 8.0
dtype: float64
Update
s=df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-df.STOCK
df['SUM_IN'].replace(0,np.nan).ffill()+df.groupby(df['SUM_IN'].gt(0).cumsum()).SUM_OUT.cumsum()-s.groupby(df['SUM_IN'].gt(0).cumsum()).bfill().fillna(0)
Out[318]:
0 500.0
1 97.0
2 39.0
3 0.0
4 100.0
5 2.0
6 278.0
7 139.0
8 80.0
9 10.0
dtype: float64
I have a dataframe of the following type
df = pd.DataFrame({'Days':[1,2,5,6,7,10,11,12],
'Value':[100.3,150.5,237.0,314.15,188.0,413.0,158.2,268.0]})
Days Value
0 1 100.3
1 2 150.5
2 5 237.0
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
and I would like to add a column '+5Ratio' whose date is the ratio betwen Value corresponding to the Days+5 and Days.
For example in first row I would have 3.13210368893 = 314.15/100.3, in the second I would have 1.24916943522 = 188.0/150.5 and so on.
Days Value +5Ratio
0 1 100.3 3.13210368893
1 2 150.5 1.24916943522
2 5 237.0 ...
3 6 314.15
4 7 188.0
5 10 413.0
6 11 158.2
7 12 268.0
I'm strugling to find a way to do it using lambda function.
Could someone give a help to find a way to solve this problem?
Thanks in advance.
Edit
In the case I am interested in the "Days" field can vary sparsly from 1 to 18180 for instance.
You can using merge , and the benefit from doing this , can handle missing value
s=df.merge(df.assign(Days=df.Days-5),on='Days')
s.assign(Value=s.Value_y/s.Value_x).drop(['Value_x','Value_y'],axis=1)
Out[359]:
Days Value
0 1 3.132104
1 2 1.249169
2 5 1.742616
3 6 0.503581
4 7 1.425532
Consider left merging on a helper dataframe, days, for consecutive daily points and then shift by 5 rows for ratio calculation. Finally remove the blank day rows:
days_df = pd.DataFrame({'Days':range(min(df.Days), max(df.Days)+1)})
days_df = days_df.merge(df, on='Days', how='left')
print(days_df)
# Days Value
# 0 1 100.30
# 1 2 150.50
# 2 3 NaN
# 3 4 NaN
# 4 5 237.00
# 5 6 314.15
# 6 7 188.00
# 7 8 NaN
# 8 9 NaN
# 9 10 413.00
# 10 11 158.20
# 11 12 268.00
days_df['+5ratio'] = days_df.shift(-5)['Value'] / days_df['Value']
final_df = days_df[days_df['Value'].notnull()].reset_index(drop=True)
print(final_df)
# Days Value +5ratio
# 0 1 100.30 3.132104
# 1 2 150.50 1.249169
# 2 5 237.00 1.742616
# 3 6 314.15 0.503581
# 4 7 188.00 1.425532
# 5 10 413.00 NaN
# 6 11 158.20 NaN
# 7 12 268.00 NaN
I have a pandas dataframe created from measured numbers. When something goes wrong with the measurement, the last value is repeated. I would like to do two things:
1. Change all repeating values either to nan or 0.
2. Keep the first repeating value and change all other values nan or 0.
I have found solutions using "shift" but they drop repeating values. I do not want to drop repeating values.My data frame looks like this:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12:15,1]=22
df.iloc[10:12,2]=0.23
giving a dataframe like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 40.000000 -0.074763 -0.840403
6 40.000000 0.709794 -1.000048
7 40.000000 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 0.230000
12 0.116258 22.000000 1.119744
13 -0.501180 22.000000 0.558941
14 0.551586 22.000000 -0.993749
what I would like to be able to do is write some code that would filter the data and give me a data frame like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 NaN 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 NaN
11 1.187208 0.964340 NaN
12 0.116258 NaN 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
or even better keep the first value and change the rest to NaN. Like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
using shift & mask:
df.shift(1) == df compares the next row to the current for consecutive duplicates.
df.mask(df.shift(1) == df)
# outputs
0 1 2
0 0.365329 0.153527 0.143244
1 0.688364 0.495755 1.065965
2 0.354180 -0.023518 3.338483
3 -0.106851 0.296802 -0.594785
4 40.000000 0.149378 1.507316
5 NaN -1.312952 0.225137
6 NaN -0.242527 -1.731890
7 NaN 0.798908 0.654434
8 2.226980 -1.117809 -1.172430
9 -1.228234 -3.129854 -1.101965
10 0.393293 1.682098 0.230000
11 -0.029907 -0.502333 NaN
12 0.107994 22.000000 0.354902
13 -0.478481 NaN 0.531017
14 -1.517769 NaN 1.552974
if you want to remove all the consecutive duplicates, test that the previous row is also the same as the current row
df.mask((df.shift(1) == df) | (df.shift(-1) == df))
Option 1
Specialized solution using diff. Get's at the final desired output.
df.mask(df.diff().eq(0))
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749