My code detects outliers in a time series. Which I want to do is to replace the outliers in de first dataframe column with the previous value which is not an outlier.
This code just detect outliers, creating a boolean array where:
True means that a value in the dataframe is an outlier
False means that a value in the dataframe is not an outlier
series = read_csv('horario_completo.csv', header=None, squeeze=True)
df=pd.DataFrame(series)
from pandas import rolling_median
consumos=df.iloc[:,0]
df['rolling_median'] = rolling_median(consumos, window=48, center=True).fillna(method='bfill').fillna(method='ffill')
threshold =50
difference = np.abs(consumos - df['rolling_median'])
outlier = difference > threshold
Up to this point, everything works.
The next step I have thought is to create a mask to replace the Truevalues with the previous value of the same column (if this was possible, it would be much faster than making a loop).
I'll try to explain it with a little example:
This is what I have:
index consumo
0 54
1 67
2 98
index outlier
0 False
1 False
2 True
And this is what I want to do:
index consumo
0 54
1 67
2 67
I think I should create a mask like this:
df.mask(outlier, df.columns=[[0]][i-1],axis=1)
obviosly this IS NOT the way to write it. It just an explanation about how I think it could be done (I'm talking about the [i-1]).
It seems you need shift:
consumo = consumo.mask(outlier, consumo.shift())
print (consumo)
0 54.0
1 67.0
2 67.0
Name: consumo, dtype: float64
Last if all values are ints add astype:
consumo = consumo.mask(outlier, consumo.shift()).astype(int)
print (consumo)
0 54
1 67
2 67
Name: consumo, dtype: int32
Related
I'm writing a function for a special case of row-wise subtraction in pandas.
First the user should be able to specify rows either by regex (i.e. "_BL[0-9]+") or by regular index i.e every 6th row
Then we must subtract every matching row from rows preceding it, but not past another match
[Optionally] Drop selected rows
Column to match on should be user-defined by either index or label
For example if:
Samples
var1
var1
something
10
20
something
20
30
something
40
30
some_BL20_thing
100
100
something
50
70
something
90
100
some_BL10_thing
100
10
Expected output should be:
Samples
var1
var1
something
-90
-80
something
-80
-70
something
-60
-70
something
-50
60
something
-10
90
My current (incomplete) implementation relies heavily on looping:
def subtract_blanks(data:pd.DataFrame, num_samples:int)->pd.DataFrame:
'''
Accepts a data dataframe and a mod int and
subtracts each blank from all mod preceding samples
'''
expr = compile(r'(_BL[0-9]{1})')
output = data.copy(deep = True)
for idx,row in output.iterrows():
if search(expr,row['Sample']):
for i in range(1,num_samples+1):
output.iloc[idx-i,data_start:] = output.iloc[idx-i,6:]-row.iloc[6:]
return output
Is there a better way of doing this? This implementation seems pretty ugly. I've also considered maybe splitting the DataFrame to chucks and operating on them instead.
Code
# Create boolean mask for matching rows
# m = np.arange(len(df)) % 6 == 5 # for index match
m = df['Samples'].str.contains(r'_BL\d+') # for regex match
# mask the values and backfill to propagate the row
# values corresponding to match in backward direction
df['var1'] = df['var1'] - df['var1'].mask(~m).bfill()
# Delete the matching rows
df = df[~m].copy()
Samples var1 var1
0 something -90.0 -80.0
1 something -80.0 -70.0
2 something -60.0 -70.0
4 something -50.0 60.0
5 something -10.0 90.0
Note: The core logic is specified in the code so I'll leave the function implementation upto the OP.
I am trying to fill missing values in subset of rows. I am using inplace=True in fillna(), but it is not working in jupyter notebook. You can see attached picture showing NaN in the first 2 rows in column of Surface. I am not sure why?
I have to do this so it is working. why? Thank you for your help.
data.loc[mark,'Surface']=data.loc[mark,'Surface'].fillna(value='TEST')
Here are my codes
mark=(data['Pad']==51) | (data['Pad']==52) | (data['Pad']==53) | (data['Pad']==54) | (data['Pad']==55)
data.loc[mark,'Surface'].fillna(value='TEST',inplace=True)
This one is working:
data.loc[mark,'Surface']=data.loc[mark,'Surface'].fillna(value='TEST')
The main issue you're bumping into here is that pandas does not have very explicit view vs copy rules. Your result indicates to me that the issue here is .loc is returning a copy instead of a view. While pandas does try to return a view from .loc, there are a decent number of caveats.
After playing around a little, it seems that using a boolean/positional index mask return a copy- you can verify this with the private _is_view attribute:
import pandas as pd
import numpy as np
df = pd.DataFrame({"Pad": range(40, 60), "Surface": np.nan})
print(df)
Pad Surface
0 40 NaN
1 41 NaN
2 42 NaN
. ... ...
19 59 NaN
# Create masks
bool_mask = df["Pad"].isin(range(51, 56))
positional_mask = np.where(bool_mask)[0]
# Check `_is_view` after simple .loc:
>>> df.loc[bool_mask, "Surface"]._is_view
False
>>> df.loc[positional_mask, "Surface"]._is_view
False
So neither of the approaches above return a "view" of the original data, which is why performing an inplace operation does not change the original dataframe. In order to return a view from .loc you will need to use a slice as your row-index.
>>> df.loc[10:15, "Surface"]._is_view
True
Now this still won't resolve your issue because the value you're filling NaN with may or may not change the dtype of the "Surface" column. In the example I have set up, "Surface" has a float64 dtype- and by filling in NaN with the value "Test", you are forcing that dtype to change which is incompatible with the original dataframe. If your "Surface" columns is an object dtype, then you don't need to worry about this.
>>> df.dtypes
Pad int64
Surface float64
# this does not work because "Test" is incompatible with float64 dtype
>>> df.loc[10:15, "Surface"].fillna("Test", inplace=True)
# this works because 0.9 is an appropriate value for a float64 dtype
>>> df.loc[10:15, "Surface"].fillna(0.9, inplace=True)
>>> print(df)
Pad Surface
.. ... ...
8 48 NaN
9 49 NaN
10 50 0.9
11 51 0.9
12 52 0.9
13 53 0.9
14 54 0.9
15 55 0.9
16 56 NaN
17 57 NaN
.. ... ...
TLDR; don't rely on inplace in pandas in general. In the bulk of its operations it still creates a copy of the underlying data and then attempts to replace the original source with the new copy. Pandas is not memory efficient so if you're worried about memory-performance you may want to switch to something designed to be zero-copy from the ground up like Vaex, instead of trying to go through pandas.
Your approach of assigning the slice of the dataframe is the most appropriate and will ensure you receive the correct result of updating the dataframe as "inplace" as possible:
>>> df.loc[bool_mask, "Surface"] = df.loc[bool_mask, "Surface"].fillna("Test")
I have a pandas dataframe like below:
Coordinate
1 (1150.0,1760.0)
28 (1260.0,1910.0)
6 (1030.0,2070.0)
12 (1170.0,2300.0)
9 (790.0,2260.0)
5 (750.0,2030.0)
26 (490.0,2130.0)
29 (360.0,1980.0)
3 (40.0,2090.0)
2 (630.0,1660.0)
20 (590.0,1390.0)
Now, I want to create a new column 'dotProduct' by applying the formula
np.dot((b-a),(b-c)) where b is the Coordinates(1260.0,1910.0) for index 28, c is the same for index 6, (i.e. (1030.0,2070.0)). The calculated product is for row 2. So, in a way I have to get the previous row value and next value too. This way I have to calculate for entire 'Coordinate' I am quite new to pandas, hence still in learning path. Please guide me a bit.
Thanks a lot for the help.
I assume that your 'Coordinate' column elements are already tuples of float values.
# Convert elements of 'Coordinate' into numpy array
df.Coordinate = df.Coordinate.apply(np.array)
# Subtract +/- 1 shifted values from original 'Coordinate'
a = df.Coordinate - df.Coordinate.shift(1)
b = df.Coordinate - df.Coordinate.shift(-1)
# take row-wise dot product based on the arrays a, b
df['dotProduct'] = [np.dot(x, y) for x, y in zip(a, b)]
# make 'Coordinate' tuple again (if you want)
df.Coordinate = df.Coordinate.apply(tuple)
Now I get this as df:
Coordinate dotProduct
1 (1150.0, 1760.0) NaN
28 (1260.0, 1910.0) 1300.0
6 (1030.0, 2070.0) -4600.0
12 (1170.0, 2300.0) 62400.0
9 (790.0, 2260.0) -24400.0
5 (750.0, 2030.0) 12600.0
26 (490.0, 2130.0) -18800.0
29 (360.0, 1980.0) -25100.0
3 (40.0, 2090.0) 236100.0
2 (630.0, 1660.0) -92500.0
20 (590.0, 1390.0) NaN
I have a huge amount of points in my dateframe, so I would want to drop some of them (ideally keeping the mean values).
e.g. currently I have
date calltime
0 1491928756414930 4643
1 1491928756419607 166
2 1491928756419790 120
3 1491928756419927 142
4 1491928756420083 121
5 1491928756420217 109
6 1491928756420409 52
7 1491928756420476 105
8 1491928756420605 35
9 1491928756420654 120
10 1491928756420787 105
11 1491928756420907 93
12 1491928756421013 37
13 1491928756421062 112
14 1491928756421187 41
Is there any way to drop some amount of items, based on sampling?
To give more details. My problem is number of values for very close intervals e.g. 1491928756421062 and 1491928756421187
So I have a chart like
And instead I wanted to somehow have a mean value for those close intervals. But maybe grouped by a second...
I would use sample(), but as you said it selects randomly. If you want to take sample according to some logic, for instance, only keeping rows whose value is mean *.9 < value < mean * 1.1, you can try the following code. Actually, it all depends on your sampling strategy.
As an example, something like this could be done.
test.csv:
1491928756414930,4643
1491928756419607,166
1491928756419790,120
1491928756419927,142
1491928756420083,121
1491928756420217,109
1491928756420409,52
1491928756420476,105
1491928756420605,35
1491928756420654,120
1491928756420787,105
1491928756420907,93
1491928756421013,37
1491928756421062,112
1491928756421187,41
sampling:
df = pd.read_csv("test.csv", ",", header=None)
mean = df[1].mean()
my_sample = df[(mean *.90 < df[1]) & (df[1] < mean * 1.10)]
You're looking for resample
df.set_index(pd.to_datetime(df.date)).calltime.resample('s').mean()
This is a more complete example
tidx = pd.date_range('2000-01-01', periods=10000, freq='10ms')
df = pd.DataFrame(dict(calltime=np.random.randint(200, size=len(tidx))), tidx)
fig, axes = plt.subplots(2, figsize=(25, 10))
df.plot(ax=axes[0])
df.resample('s').mean().plot(ax=axes[1])
fig.tight_layout()
I have a question similar to this and this. The difference is that I have to select row by position, as I do not know the index.
I want to do something like df.iloc[0, 'COL_NAME'] = x, but iloc does not allow this kind of access. If I do df.iloc[0]['COL_NAME'] = x the warning about chained indexing appears.
For mixed position and index, use .ix. BUT you need to make sure that your index is not of integer, otherwise it will cause confusions.
df.ix[0, 'COL_NAME'] = x
Update:
Alternatively, try
df.iloc[0, df.columns.get_loc('COL_NAME')] = x
Example:
import pandas as pd
import numpy as np
# your data
# ========================
np.random.seed(0)
df = pd.DataFrame(np.random.randn(10, 2), columns=['col1', 'col2'], index=np.random.randint(1,100,10)).sort_index()
print(df)
col1 col2
10 1.7641 0.4002
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337
# .iloc with get_loc
# ===================================
df.iloc[0, df.columns.get_loc('col2')] = 100
df
col1 col2
10 1.7641 100.0000
24 0.1440 1.4543
29 0.3131 -0.8541
32 0.9501 -0.1514
33 1.8676 -0.9773
36 0.7610 0.1217
56 1.4941 -0.2052
58 0.9787 2.2409
75 -0.1032 0.4106
76 0.4439 0.3337
One thing I would add here is that the at function on a dataframe is much faster particularly if you are doing a lot of assignments of individual (not slice) values.
df.at[index, 'col_name'] = x
In my experience I have gotten a 20x speedup. Here is a write up that is Spanish but still gives an impression of what's going on.
If you know the position, why not just get the index from that?
Then use .loc:
df.loc[index, 'COL_NAME'] = x
You can use:
df.set_value('Row_index', 'Column_name', value)
set_value is ~100 times faster than .ix method. It also better then use df['Row_index']['Column_name'] = value.
But since set_value is deprecated now so .iat/.at are good replacements.
For example if we have this data_frame
A B C
0 1 8 4
1 3 9 6
2 22 33 52
if we want to modify the value of the cell [0,"A"] we can do
df.iat[0,0] = 2
or df.at[0,'A'] = 2
another way is, you assign a column value for a given row based on the index position of a row, the index position always starts with zero, and the last index position is the length of the dataframe:
df["COL_NAME"].iloc[0]=x
To modify the value in a cell at the intersection of row "r" (in column "A") and column "C"
retrieve the index of the row "r" in column "A"
i = df[ df['A']=='r' ].index.values[0]
modify the value in the desired column "C"
df.loc[i,"C"]="newValue"
Note: before, be sure to reset the index of rows ...to have a nice index list!
df=df.reset_index(drop=True)
Another way is to get the row index and then use df.loc or df.at.
# get row index 'label' from row number 'irow'
label = df.index.values[irow]
df.at[label, 'COL_NAME'] = x
Extending Jianxun's answer, using set_value mehtod in pandas. It sets value for a column at given index.
From pandas documentations:
DataFrame.set_value(index, col, value)
To set value at particular index for a column, do:
df.set_value(index, 'COL_NAME', x)
Hope it helps.