I have a pandas df that looks like this for example:
%diff
-0.087704164
0.003908466
-0.032150706
-0.035684163
0.001682029
-0.072205803
0.031636864
-0.069263158
-0.214883511
-0.109286469
0.274932615
-0.016913319
-0.075268817
0.191906977
0.043861703
-0.048598131
0.01280943
0.014509621
0.075564054
-0.024034701
0.009107468
0.023465704
I want to calculate the square root of 252 multiple by the standard deviation of the last 20 values in the column '%diff.
%diff std
-0.087704164
0.003908466
-0.032150706
-0.035684163
0.001682029
-0.072205803
0.031636864
-0.069263158
-0.214883511
-0.109286469
0.274932615
-0.016913319
-0.075268817
0.191906977
0.043861703
-0.048598131
0.01280943
0.014509621
0.075564054
-0.024034701 165.9%
0.009107468 163.2%
0.023465704 163.4%
The code I tried is:
df1['std'] = 252**(1.0/2) * df1['%diff'].std().split(20)
But I get an unsupported operand error
You need rolling with a window of 20 and std like
df1['std'] = 252**(1.0/2) * df1.rolling(20)['%diff'].std()
print (df1)
%diff std
0 -0.087704 NaN
1 0.003908 NaN
2 -0.032151 NaN
3 -0.035684 NaN
4 0.001682 NaN
5 -0.072206 NaN
6 0.031637 NaN
7 -0.069263 NaN
8 -0.214884 NaN
9 -0.109286 NaN
10 0.274933 NaN
11 -0.016913 NaN
12 -0.075269 NaN
13 0.191907 NaN
14 0.043862 NaN
15 -0.048598 NaN
16 0.012809 NaN
17 0.014510 NaN
18 0.075564 NaN
19 -0.024035 1.659144
20 0.009107 1.631865
21 0.023466 1.634266
Related
so I am trying to add rows to data frame that should follow a numeric order 1 to 52
but my data is missing numbers, so I need to add these rows and fill these spots with NaN values or null.
df = pd.DataFrame("Weeks": [1,2,3,15,16,20,21,52],
"Values": [10,10,10,10,50,60,70,40])
Desired output:
Weeks Values
1 10
2 10
3 10
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
...
52 40
and so on until it reach Weeks = 52
My solution:
new_df = pd.DataFrame("Weeks": "" , "Values":"")
for x in range(1,53):
for i in df.Weeks:
if x == i:
new_df["Weeks"] = x
new_df["Values"] = df.Values[i]
The problem it is super inefficient, anyone know a way to do it in much efficient way?
You could use set_index to set the Weeks as index an reindex with a range up to the maximum week:
df.set_index('Weeks').reindex(range(1,df.Weeks.max()))
Or accounting for the minimum week too:
df.set_index('Weeks').reindex(range(*df.Weeks.agg(('min', 'max'))))
Values
Weeks
1 10.0
2 10.0
3 10.0
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
10 NaN
11 NaN
12 NaN
13 NaN
14 NaN
15 10.0
16 50.0
17 NaN
...
I have a pandas dataframe created from measured numbers. When something goes wrong with the measurement, the last value is repeated. I would like to do two things:
1. Change all repeating values either to nan or 0.
2. Keep the first repeating value and change all other values nan or 0.
I have found solutions using "shift" but they drop repeating values. I do not want to drop repeating values.My data frame looks like this:
df = pd.DataFrame(np.random.randn(15, 3))
df.iloc[4:8,0]=40
df.iloc[12:15,1]=22
df.iloc[10:12,2]=0.23
giving a dataframe like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 40.000000 -0.074763 -0.840403
6 40.000000 0.709794 -1.000048
7 40.000000 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 0.230000
12 0.116258 22.000000 1.119744
13 -0.501180 22.000000 0.558941
14 0.551586 22.000000 -0.993749
what I would like to be able to do is write some code that would filter the data and give me a data frame like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 NaN 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 NaN
11 1.187208 0.964340 NaN
12 0.116258 NaN 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
or even better keep the first value and change the rest to NaN. Like this:
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
using shift & mask:
df.shift(1) == df compares the next row to the current for consecutive duplicates.
df.mask(df.shift(1) == df)
# outputs
0 1 2
0 0.365329 0.153527 0.143244
1 0.688364 0.495755 1.065965
2 0.354180 -0.023518 3.338483
3 -0.106851 0.296802 -0.594785
4 40.000000 0.149378 1.507316
5 NaN -1.312952 0.225137
6 NaN -0.242527 -1.731890
7 NaN 0.798908 0.654434
8 2.226980 -1.117809 -1.172430
9 -1.228234 -3.129854 -1.101965
10 0.393293 1.682098 0.230000
11 -0.029907 -0.502333 NaN
12 0.107994 22.000000 0.354902
13 -0.478481 NaN 0.531017
14 -1.517769 NaN 1.552974
if you want to remove all the consecutive duplicates, test that the previous row is also the same as the current row
df.mask((df.shift(1) == df) | (df.shift(-1) == df))
Option 1
Specialized solution using diff. Get's at the final desired output.
df.mask(df.diff().eq(0))
0 1 2
0 1.239916 1.109434 0.305490
1 0.248682 1.472628 0.630074
2 -0.028584 -1.116208 0.074299
3 -0.784692 -0.774261 -1.117499
4 40.000000 0.283084 -1.495734
5 NaN -0.074763 -0.840403
6 NaN 0.709794 -1.000048
7 NaN 0.920943 0.681230
8 -0.701831 0.547689 -0.128996
9 -0.455691 0.610016 0.420240
10 -0.856768 -1.039719 0.230000
11 1.187208 0.964340 NaN
12 0.116258 22.000000 1.119744
13 -0.501180 NaN 0.558941
14 0.551586 NaN -0.993749
I'm using python pandas for data analysis.
I have a data frame with raw data and a data frame with faulty data where the correct ones are filled with NaN values. I want to create a new data frame where the faulty data is taken away from the raw data and NaN values are filled in its place.
Raw Data
NE NW S
timestamp
0 15 12 13
1 15 19 13
2 15 12 13
3 12 18 11
Faulty data
NE NW S
timestamp
0 NaN NaN NaN
1 15 19 NaN
2 NaN NaN NaN
3 12 18 NaN
I want to get the following data frame:
Correct data
NE NW S
timestamp
0 15 12 13
1 NaN NaN 13
2 15 12 13
3 NaN NaN 11
How do I do this with pandas?
Use isnull on the faulty df to mask your raw df
In [10]:
raw[faulty.isnull()]
Out[10]:
NE NW S
timestamp
0 15 12 13
1 NaN NaN 13
2 15 12 13
3 NaN NaN 11
I posted this a while ago but no one could solve the problem.
first let's create some correlated DataFrames and call rolling_corr(), with dropna() as I am going to sparse it up later and no min_period set as I want to keep the results robust and consistent with the set window
hey=(DataFrame(np.random.random((15,3)))+.2).cumsum()
hoo=(DataFrame(np.random.random((15,3)))+.2).cumsum()
hey_corr= rolling_corr(hey.dropna(),hoo.dropna(), 4)
gives me
In [388]: hey_corr
Out[388]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991087 0.978383 0.992614
4 0.974117 0.974871 0.989411
5 0.966969 0.972894 0.997427
6 0.942064 0.994681 0.996529
7 0.932688 0.986505 0.991353
8 0.935591 0.966705 0.980186
9 0.969994 0.977517 0.931809
10 0.979783 0.956659 0.923954
11 0.987701 0.959434 0.961002
12 0.907483 0.986226 0.978658
13 0.940320 0.985458 0.967748
14 0.952916 0.992365 0.973929
now when I sparse it up it gives me...
hey.ix[5:8,0] = np.nan
hey.ix[6:10,1] = np.nan
hoo.ix[5:8,0] = np.nan
hoo.ix[6:10,1] = np.nan
hey_corr_sparse = rolling_corr(hey.dropna(),hoo.dropna(), 4)
hey_corr_sparse
Out[398]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991273 0.992557 0.985773
4 0.953041 0.999411 0.958595
11 0.996801 0.998218 0.992538
12 0.994919 0.998656 0.995235
13 0.994899 0.997465 0.997950
14 0.971828 0.937512 0.994037
chucks of data are missing, it looks like we only have data where the dropna() can form a complete window across the dataframe
I can solve the problem with a ugly iter-fudge as follows...
hey_corr_sparse = DataFrame(np.nan, index=hey.index,columns=hey.columns)
for i in hey_corr_sparse.columns:
hey_corr_sparse.ix[:,i] = rolling_corr(hey.ix[:,i].dropna(),hoo.ix[:,i].dropna(), 4)
hey_corr_sparse
Out[406]:
0 1 2
0 NaN NaN NaN
1 NaN NaN NaN
2 NaN NaN NaN
3 0.991273 0.992557 0.985773
4 0.953041 0.999411 0.958595
5 NaN 0.944246 0.961917
6 NaN NaN 0.941467
7 NaN NaN 0.963183
8 NaN NaN 0.980530
9 0.993865 NaN 0.984484
10 0.997691 NaN 0.998441
11 0.978982 0.991095 0.997462
12 0.914663 0.990844 0.998134
13 0.933355 0.995848 0.976262
14 0.971828 0.937512 0.994037
Does anyone in the community know if it is possible make this an array function to give this result, I've attempted to use .apply but drawn a blank, is it even possible to .apply a function that works on two data structures (hey and hoo in this example)?
many thanks, LW
you can try this:
>>> def sparse_rolling_corr(ts, other, window):
... return rolling_corr(ts.dropna(), other[ts.name].dropna(), window).reindex_like(ts)
...
>>> hey.apply(sparse_rolling_corr, args=(hoo, 4))
I have a Pandas dataframe as below:
incomplete_df = pd.DataFrame({'event1': [1, 2 ,np.NAN,5 ,6,np.NAN,np.NAN,11 ,np.NAN,15],
'event2': [np.NAN,1 ,np.NAN,3 ,4,7 ,np.NAN,12 ,np.NAN,17],
'event3': [np.NAN,np.NAN,np.NAN,np.NAN,6,4 ,9 ,np.NAN,3 ,np.NAN]})
incomplete_df
event1 event2 event3
0 1 NaN NaN
1 2 1 NaN
2 NaN NaN NaN
3 5 3 NaN
4 6 4 6
5 NaN 7 4
6 NaN NaN 9
7 11 12 NaN
8 NaN NaN 3
9 15 17 NaN
I want to append a reason column that gives a standard text + the column name of the minimum value of that row. In other words, the desired output is:
event1 event2 event3 reason
0 1 NaN NaN 'Reason is event1'
1 2 1 NaN 'Reason is event2'
2 NaN NaN NaN 'Reason is None'
3 5 3 NaN 'Reason is event2'
4 6 4 6 'Reason is event2'
5 NaN 7 4 'Reason is event3'
6 NaN NaN 9 'Reason is event3'
7 11 12 NaN 'Reason is event1'
8 NaN NaN 3 'Reason is event3'
9 15 17 NaN 'Reason is event1'
I can do incomplete_df.apply(lambda x: min(x),axis=1) but this does not ignore NAN's and more importantly returns the value rather than the name of the corresponding column.
EDIT:
Having found out about the idxmin() function from EMS's answer, I timed the the two solutions below:
timeit.repeat("incomplete_df.apply(lambda x: x.idxmin(), axis=1)", "from __main__ import incomplete_df", number=1000)
[0.35261858807214175, 0.32040155511039536, 0.3186818508661702]
timeit.repeat("incomplete_df.T.idxmin()", "from __main__ import incomplete_df", number=1000)
[0.17752145781657447, 0.1628651645393262, 0.15563708275042387]
It seems like the transpose approach is twice as fast.
incomplete_df['reason'] = "Reason is " + incomplete_df.T.idxmin()
ely's answer transposes the dataframe but this is not necessary.
Use the argument axis="columns" instead:
incomplete_df['reason'] = "Reason is " + incomplete_df.idxmin(axis="columns")
This is arguably easier to understand and faster (tested on Python 3.10.2):