Detecting outliers in a Pandas dataframe using a rolling standard deviation - python

I have a DataFrame for a fast Fourier transformed signal.
There is one column for the frequency in Hz and another column for the corresponding amplitude.
I have read a post made a couple of years ago, that you can use a simple boolean function to exclude or only include outliers in the final data frame that are above or below a few standard deviations.
df = pd.DataFrame({'Data':np.random.normal(size=200)}) # example dataset of normally distributed data.
df[~(np.abs(df.Data-df.Data.mean())>(3*df.Data.std()))] # or if you prefer the other way around
The problem is that my signal drops several magnitudes (up to 10 000 times smaller) as frequency increases up to 50 000Hz. Therefore, I am unable to use a function that only exports values above 3 standard deviation because I will only pick up the "peaks" outliers from the first 50 Hz.
Is there a way I can export outliers in my dataframe that are above 3 rolling standard deviations of a rolling mean instead?

This is maybe best illustrated with a quick example. Basically you're comparing your existing data to a new column that is the rolling mean plus three standard deviations, also on a rolling basis.
import pandas as pd
import numpy as np
np.random.seed(123)
df = pd.DataFrame({'Data':np.random.normal(size=200)})
# Create a few outliers (3 of them, at index locations 10, 55, 80)
df.iloc[[10, 55, 80]] = 40.
r = df.rolling(window=20) # Create a rolling object (no computation yet)
mps = r.mean() + 3. * r.std() # Combine a mean and stdev on that object
print(df[df.Data > mps.Data]) # Boolean filter
# Data
# 55 40.0
# 80 40.0
To add a new column filtering only to outliers, with NaN elsewhere:
df['Peaks'] = df['Data'].where(df.Data > mps.Data, np.nan)
print(df.iloc[50:60])
Data Peaks
50 -1.29409 NaN
51 -1.03879 NaN
52 1.74371 NaN
53 -0.79806 NaN
54 0.02968 NaN
55 40.00000 40.0
56 0.89071 NaN
57 1.75489 NaN
58 1.49564 NaN
59 1.06939 NaN
Here .where returns
An object of same shape as self and whose corresponding entries are
from self where cond is True and otherwise are from other.

Related

Pandas recursive cumulative sum?

I am having a bit of trouble getting some pandas code to work. My basic problem is given a set of transactions and a set of balances, I need to come up with "balancing transactions"; i.e. fake transactions (which will be tagged as such) that will make it so that the sum of transactions are equal to balances (ignore the fact that in most cases this isn't a good idea; it makes sense in the context I am working in, I promise!).
Sample data:
import pandas as pd
from io import StringIO
txn_data = StringIO(
"""Contract,Txndate,TxnAmount
"wer42134423",1/1/2014, 50
"wer42134423",1/2/2014, -10
"wer42134423",1/3/2014, 100
"wer42134423",1/4/2014, -50
"wer42134423",1/5/2014, -10
"wer42134423",1/6/2014, 20
"wer42134423",1/7/2014, 50
"wer42134423",1/8/2014, -70
"wer42134423",1/10/2014, 21
"wer42134423",1/11/2014, -3
"""
)
txns=pd.read_csv(txn_data,parse_dates=["Txndate"])
txns.head()
balance_data = StringIO(
"""Contract,Baldate,Amount
"wer42134423", 1/1/2014, 50
"wer42134423", 1/4/2014, 100
"wer42134423", 1/9/2014, 96
"wer42134423", 1/11/2014, 105
"""
)
balances=pd.read_csv(balance_data,parse_dates=["Baldate"])
txns["CumulativeSumofTxns"]=txns.groupby("Contract")["TxnAmount"].cumsum()
balances_merged=pd.merge_asof(balances,txns,by="Contract",left_on=["Baldate"],right_on=["Txndate"])
balances_merged.head()
I can do this fairly easily in Excel; I merge the cumulative sum of transactions onto my balance data, then just apply a fairly simple sum formula, and then everything can balance out.
However, I cannot for the life of me figure out how to do the same in Pandas (without manually iterating through each "cell", which would be horrendous for performance). After doing a lot of digging it almost seem like the expanding window function would do the trick, but I couldn't get that to work after multiple attempts with shifting and such. I think the problem is that every entry in my new column is dependent on entries for the same row (namely, the current balance and cumulative sum of transactions) and all the prior entries in the column (namely, all the prior balancing transactions). Any help appreciated!
IIUC, do you want?
balances_merged['Cumulative Sum Balancing Transactions'] = balances_merged['Amount'] - balances_merged['CumulativeSumofTxns']
balances_merged['Balancing Transaction'] = balances_merged['Cumulative Sum Balancing Transactions'].diff()
balances_merged
Output:
Contract Baldate Amount Txndate TxnAmount CumulativeSumofTxns Cumulative Sum Balancing Transactions Balancing Transaction
0 wer42134423 2014-01-01 50 2014-01-01 50 50 0 NaN
1 wer42134423 2014-01-04 100 2014-01-04 -50 90 10 10.0
2 wer42134423 2014-01-09 96 2014-01-08 -70 80 16 6.0
3 wer42134423 2014-01-11 105 2014-01-11 -3 98 7 -9.0

How to calculate best-fit line for each row with NaN?

I have a dataset storing marathon segment splits (5K, 10K, ...) in seconds and identifiers (age, gender, country) as columns and individuals as rows. Each cell for a marathon segment split column may contain either a float (specifying the number of seconds required to reach the segment) or "NaN". A row may contain up to 4 NaN values. Here is some sample data:
Age M/F Country 5K 10K 15K 20K Half Official Time
2323 38 M CHI 1300.0 2568.0 3834.0 5107.0 5383.0 10727.0
2324 23 M USA 1286.0 2503.0 3729.0 4937.0 5194.0 10727.0
2325 36 M USA 1268.0 2519.0 3775.0 5036.0 5310.0 10727.0
2326 37 M POL 1234.0 2484.0 3723.0 4972.0 5244.0 10727.0
2327 32 M DEN NaN 2520.0 3782.0 5046.0 5319.0 10728.0
I intend to calculate a best fit line for marathon split times (using only the columns between "5K" to "Half") for each row with at least one NaN; from the best fit line for the row, I want to impute a data point to replace the NaN with.
From the sample data, I intend to calculate a best fit line for row 2327 only (using values 2520.0, 3782.0, 5046.0, and 5319.0). Using this best fit line for row 2327, I intend to replace the NaN 5K time with the predicted 5K time.
How can I calculate this best fit line for each row with NaN?
Thanks in advance.
I "extrapolated" a solution from here from 2015 https://stackoverflow.com/a/31340344/6366770 (pun intended). Extrapolation definition I am not sure if in 2021 pandas has reliable extrapolation methods, so you might have to use scipy or other libraries.
When doing the Extrapolation , I excluded the "Half" column. That's because the running distances of 5K, 10K, 15K and 20K are 100% linear. It is literally a straight line if you exclude the half marathon column. But, that doesn't mean that expected running times are linear. Obviously, as you run a longer distance your average time per kilometer is lower. But, this "gets the job done" without getting too involved in an incredibly complex calculation.
Also, this is worth noting. Let's say that the first column was 1K instead of 5K. Then, this method would fail. It only works because the distances are linear. If it was 1K, you would also have to use the data from the rows of the other runners, unless you were making calculations based off the kilometers in the column names themselves. Either way, this is an imperfect solution, but much better than pd.interpolation. I linked another potential solution in the comments of tdy's answer.
import scipy as sp
import pandas as pd
# we focus on the four numeric columns from 5K-20K and and Transpose the dataframe, since we are going horizontally across columns. T
#T he index must also be numeric, so we drop it, but don't worry, we add back just the numbers and maintain the index later on.
df_extrap = df.iloc[:,4:8].T.reset_index(drop=True)
# create a scipy interpolation function to be called by a custom extrapolation function later on
def scipy_interpolate_func(s):
s_no_nan = s.dropna()
return sp.interpolate.interp1d(s_no_nan.index.values, s_no_nan.values, kind='linear', bounds_error=False)
def my_extrapolate_func(scipy_interpolate_func, new_x):
x1, x2 = scipy_interpolate_func.x[0], scipy_interpolate_func.x[-1]
y1, y2 = scipy_interpolate_func.y[0], scipy_interpolate_func.y[-1]
slope = (y2 - y1) / (x2 - x1)
return y1 + slope * (new_x - x1)
#Concat each extrapolated column altogether and transpose back to initial shape to be added to the original dataframe
s_extrapolated = pd.concat([pd.Series(my_extrapolate_func(scipy_interpolate_func(df_extrap[s]),
df_extrap[s].index.values),
index=df_extrap[s].index) for s in df_extrap.columns], axis=1).T
cols = ['5K', '10K', '15K', '20K']
df[cols] = s_extrapolated
df
Out[1]:
index Age M/F Country 5K 10K 15K 20K Half \
0 2323 38 M CHI 1300.0 2569.0 3838.0 5107.0 5383.0
1 2324 23 M USA 1286.0 2503.0 3720.0 4937.0 5194.0
2 2325 36 M USA 1268.0 2524.0 3780.0 5036.0 5310.0
3 2326 37 M POL 1234.0 2480.0 3726.0 4972.0 5244.0
4 2327 32 M DEN 1257.0 2520.0 3783.0 5046.0 5319.0
Official Time
0 10727.0
1 10727.0
2 10727.0
3 10727.0
4 10728.0

Python - Pandas: how can I interpolate between values that grow exponentially?

I have a Pandas Series that contains the price evolution of a product (my country has high inflation), or say, the amount of coronavirus infected people in a certain country. The values in both of these datasets grows exponentially; that means that if you had something like [3, NaN, 27] you'd want to interpolate so that the missing value is filled with 9 in this case. I checked the interpolation method in the Pandas documentation but unless I missed something, I didn't find anything about this type of interpolation.
I can do it manually, you just take the geometric mean, or in the case of more values, get the average growth rate by doing (final value/initial value)^(1/distance between them) and then multiply accordingly. But there's a lot of values to fill in in my Series, so how do I do this automatically? I guess I'm missing something since this seems to be something very basic.
Thank you.
You could take the logarithm of your series, interpolate lineraly and then transform it back to your exponential scale.
import pandas as pd
import numpy as np
arr = np.exp(np.arange(1,10))
arr = pd.Series(arr)
arr[3] = None
0 2.718282
1 7.389056
2 20.085537
3 NaN
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64
arr = np.log(arr) # Transform according to assumed process.
arr = arr.interpolate('linear') # Interpolate.
np.exp(arr) # Invert previous transformation.
0 2.718282
1 7.389056
2 20.085537
3 54.598150
4 148.413159
5 403.428793
6 1096.633158
7 2980.957987
8 8103.083928
dtype: float64

Inserting missing numbers in dataframe

I have a program that ideally measures the temperature every second. However, in reality this does not happen. Sometimes, it skips a second or it breaks down for 400 seconds and then decides to start recording again. This leaves gaps in my 2-by-n dataframe, where ideally n = 86400 (the amount of seconds in a day). I want to apply some sort of moving/rolling average to it to get a nicer plot, but if I do that to the "raw" datafiles, the amount of data points becomes less. This is shown here, watch the x-axis. I know the "nice data" doesn't look nice yet; I'm just playing with some values.
So, I want to implement a data cleaning method, which adds data to the dataframe. I thought about it, but don't know how to implement it. I thought of it as follows:
If the index is not equal to the time, then we need to add a number, at time = index. If this gap is only 1 value, then the average of the previous number and the next number will do for me. But if it is bigger, say 100 seconds are missing, then a linear function needs to be made, which will increase or decrease the value steadily.
So I guess a training set could be like this:
index time temp
0 0 20.10
1 1 20.20
2 2 20.20
3 4 20.10
4 100 22.30
Here, I would like to get a value for index 3, time 3 and the values missing between time = 4 and time = 100. I'm sorry about my formatting skills, I hope it is clear.
How would I go about programming this?
Use merge with complete time column and then interpolate:
# Create your table
time = np.array([e for e in np.arange(20) if np.random.uniform() > 0.6])
temp = np.random.uniform(20, 25, size=len(time))
temps = pd.DataFrame([time, temp]).T
temps.columns = ['time', 'temperature']
>>> temps
time temperature
0 4.0 21.662352
1 10.0 20.904659
2 15.0 20.345858
3 18.0 24.787389
4 19.0 20.719487
The above is a random table generated with missing time data.
# modify it
filled = pd.Series(np.arange(temps.iloc[0,0], temps.iloc[-1, 0]+1))
filled = filled.to_frame()
filled.columns = ['time'] # Create a fully filled time column
merged = pd.merge(filled, temps, on='time', how='left') # merge it with original, time without temperature will be null
merged.temperature = merged.temperature.interpolate() # fill nulls linearly.
# Alternatively, use reindex, this does the same thing.
final = temps.set_index('time').reindex(np.arange(temps.time.min(),temps.time.max()+1)).reset_index()
final.temperature = final.temperature.interpolate()
>>> merged # or final
time temperature
0 4.0 21.662352
1 5.0 21.536070
2 6.0 21.409788
3 7.0 21.283505
4 8.0 21.157223
5 9.0 21.030941
6 10.0 20.904659
7 11.0 20.792898
8 12.0 20.681138
9 13.0 20.569378
10 14.0 20.457618
11 15.0 20.345858
12 16.0 21.826368
13 17.0 23.306879
14 18.0 24.787389
15 19.0 20.719487
First you can set the second values to actual time values as such:
df.index = pd.to_datetime(df['time'], unit='s')
After which you can use pandas' built-in time series operations to resample and fill in the missing values:
df = df.resample('s').interpolate('time')
Optionally, if you still want to do some smoothing you can use the following operation for that:
df.rolling(5, center=True, win_type='hann').mean()
Which will smooth with a 5 element wide Hanning window. Note: any window-based smoothing will cost you value points at the edges.
Now your dataframe will have datetimes (including date) as index. This is required for the resample method. If you want to lose the date, you can simply use:
df.index = df.index.time

How to check the type of missing data in python(randomly missing or not)?

I have a big amount of data with me(93 files, ~150mb each). The data is a time series, i.e, information about a given set of coordinates(3.3 million latitude-longitude values) is recorded and stored everyday for 93 days, and the whole data is broken up into 93 files respectively. Example of two such files:
Day 1:
lon lat A B day1
68.4 8.4 NaN 20 20
68.4 8.5 16 20 18
68.6 8.4 NaN NaN NaN
.
.
Day 2:
lon lat C D day2
68.4 8.4 NaN NaN NaN
68.4 8.5 24 25 24.5
68.6 8.4 NaN NaN NaN
.
.
I am interested in understanding the nature of the missing data in the columns 'day1', 'day2', 'day3', etc. For example, if the values missing in the concerned columns are evenly distributed among all the set of coordinates then the data is probably missing at random, but if the missing values are concentrated more in a particular set of coordinates then my data will become biased. Consider the way my data is divided into multiple files of large sizes and isn't in a very standard form to operate on making it harder to use some tools.
I am looking for a diagnostic tool or visualization in python that can check/show how the missing data is distributed over the set of coordinates so I can impute/ignore it appropriately.
Thanks.
P.S: This is the first time I am handling missing data so it would be great to see if there exists a workflow which people who do similar kind of work follow.
Assuming that you read file and name it df. You can count amount of NaNs using:
df.isnull().sum()
It will return you amount of NaNs per column.
You could also use:
df.isnull().sum(axis=1).value_counts()
This on the other hand will sum number of NaNs per row and then calculate number of rows with no NaNs, 1 NaN, 2 NaNs and so on.
Regarding working with files of such size, to speed up loading data and processing it I recommend using Dask and change format of your files preferably to parquet so that you can read and write to it in parallel.
You could easily recreate function above in Dask like this:
from dask import dataframe as dd
dd.read_parquet(file_path).isnull().sum().compute()
Answering the comment question:
Use .loc to slice your dataframe, in code below I choose all rows : and two columns ['col1', 'col2'].
df.loc[:, ['col1', 'col2']].isnull().sum(axis=1).value_counts()

Categories