Remove outliers while preserving the timestamps in dataframe - python

I have data in a dataframe in the format shown below:
metric timestamp cas_pre fl_rat ...
0 2017-04-06 11:25:00 687.982849 1627.040283 ...
1 2017-04-06 11:30:00 693.427673 1506.217285 ...
2 2017-04-06 11:35:00 692.686310 1537.114807 ...
....
45 2017-04-06 11:35:00 51987.427673 1537.114807 ...
....
101003 2017-04-06 11:35:00 692.686310 1537.114807 ...
It's very clear that row 45 needs to be eliminated since it's an anomaly. There are multiple columns and quite a few rows (100,000+). Now I want to remove the outliers from this, and have been using the code:
drop_df = df.drop(columns=['timestamp'])
drop_df = drop_df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]
However, this would give me the data without the timestamps. This is due to the fact that I cannot use timestamps within the z-score calculation. However, I want to preserve the timestamps, the correlation to which is completely lost over the filtering with the z-score. This is shown below:
metric timestamp cas_pre fl_rat ...
0 2017-04-06 11:25:00 687.982849 1627.040283 ...
1 2017-04-06 11:30:00 693.427673 1506.217285 ...
2 2017-04-06 11:35:00 692.686310 1537.114807 ...
....
101003 2017-04-06 11:35:00 692.686310 1537.114807 ...
How can I achieve that?

It's probably better to explicitly set which columns to use for the z-score calculation:
cols = ['cas_pre', 'fl_rat', ...]
df = df[(np.abs(stats.zscore(df[cols])) < 3).all(axis=1)]
Alternatively, you can drop the timestamp column only in the input to the z-score calculation:
drop_df = df.drop(columns=['timestamp'])
df = df[(np.abs(stats.zscore(drop_df)) < 3).all(axis=1)]

Related

python masking each day in dataframe

I have to make a daily sum on a dataframe but only if at least 70% of the daily data is not NaN. If it is then this day must not be taken into account. Is there a way to create such a mask? My dataframe is more than 17 years of hourly data.
my data is something like this:
clear skies all skies Lab
2015-02-26 13:00:00 597.5259 376.1830 307.62
2015-02-26 14:00:00 461.2014 244.0453 199.94
2015-02-26 15:00:00 283.9003 166.5772 107.84
2015-02-26 16:00:00 93.5099 50.7761 23.27
2015-02-26 17:00:00 1.1559 0.2784 0.91
... ... ...
2015-12-05 07:00:00 95.0285 29.1006 45.23
2015-12-05 08:00:00 241.8822 120.1049 113.41
2015-12-05 09:00:00 363.8040 196.0568 244.78
2015-12-05 10:00:00 438.2264 274.3733 461.28
2015-12-05 11:00:00 456.3396 330.6650 447.15
if I groupby and aggregate than there is no way to know if in any day there was some lack of data and some days will have lower sums and therefore lowering my monthly means
As said in the comments, use groupby to group the data by date and then write an appropriate selection. This is an example that would sum all days (assuming regular data points, 24 per day) with less than 50% of nan entries:
import pandas as pd
import numpy as np
# create a date range
date_rng = pd.date_range(start='1/1/2018', end='1/1/2021', freq='H')
# create random data
df = pd.DataFrame({"data":np.random.randint(0,100,size=(len(date_rng)))}, index = date_rng)
# set some values to nan
df["data"][df["data"] > 50] = np.nan
# looks like this
df.head(20)
# sum everything where less than 50% are nan
df.groupby(df.index.date).sum()[df.isna().groupby(df.index.date).sum() < 12]
Example output:
data
2018-01-01 NaN
2018-01-02 NaN
2018-01-03 487.0
2018-01-04 NaN
2018-01-05 421.0
... ...
2020-12-28 NaN
2020-12-29 NaN
2020-12-30 NaN
2020-12-31 392.0
2021-01-01 0.0
An alternative solution - you may find it useful & flexible:
# pip install convtools
from convtools import conversion as c
total_number = c.ReduceFuncs.Count()
total_not_none = c.ReduceFuncs.Count(where=c.item("amount").is_not(None))
total_sum = c.ReduceFuncs.Sum(c.item("amount"))
input_data = [] # e.g. iterable of dicts
converter = (
c.group_by(
c.item("key1"),
c.item("key2"),
)
.aggregate(
{
"key1": c.item("key1"),
"key2": c.item("key2"),
"sum_if_70": c.if_(
total_not_none / total_number < 0.7,
None,
total_sum,
),
}
)
.gen_converter(
debug=False
) # install black and set to True to see the generated ad-hoc code
)
result = converter(input_data)

Python Pandas can do these tasks?

I have a time series data frame for eight Years (2013-2020) has Hourly data, each Year has Nine zones, under each zone two columns("Gen", "Load") as follows:
A ZONE B ZONE ... G ZONE H ZONE I ZONE
date_time GEN LOAD GEN LOAD ... LOAD GEN LOAD GEN LOAD
2013-01-01 00:00:00 725.7 5,859.5 312.2 3,194.7 ... 77.1 706.0 227.1 495.0 861.9
2013-01-01 01:00:00 436.2 450.5 248.0 198.0 ... 865.5 240.7 107.9 640.5 767.3
2013-01-01 02:00:00 464.5 160.2 144.2 068.3 ... 738.7 044.7 32.7 509.3 700.4
2013-01-01 03:00:00 169.9 733.8 268.1 869.5 ... 671.7 649.4 951.3 626.8 652.1
2013-01-01 04:00:00 145.4 553.4 280.2 872.8 ... 761.5 561.0 912.9 552.1 637.3
... ... ... ... ... ... ... ... ... ... ... ...
2020-12-31 19:00:00 450.9 951.7 371.4 516.3 ... 461.7 808.9 471.4 983.7 447.8
2020-12-31 20:00:00 553.0 936.5 848.7 233.9 ... 397.3 978.3 404.3 490.9 233.0
2020-12-31 21:00:00 458.6 735.6 716.8 121.7 ... 385.1 808.0 192.0 131.5 70.1
2020-12-31 22:00:00 515.8 651.6 693.5 142.4 ... 291.4 826.1 16.8 591.9 863.2
2020-12-31 23:00:00 218.6 293.4 448.2 14.2 ... 340.6 435.0 897.4 622.5 768.3
What I want is the following:
1- Detect outliers in each column which is more or less three time Standard Deviation
of that column and put it in a new column its name "A_gen_outliers" if the there is
outliers in "GEN"column under "A Zone" as well as "A_load_outliers" if the there is
outliers in "LOAD"column under "A Zone". Number of new columns are 18 columns.
2- A new column represents sum of "Gen" columns
3- A new column represents sum of "Load" columns
4- A new column represents "GEN" column calculate A_GEN_div = cell value/maximum value of "GEN column under A Zone for each year for example 725.7/725.7=1 for the first cell and 436.2/725.1 for second cell and for last cell 218.6/553. etc. and the same for all "GEN" columns and also for "LOAD" columns- proposed names "A_Load_div".
Number of new columns are 18 columns.
Number of total new columns are "18 *2 + 2" columns
Thanks in advance.
I think this might help. Note that this will keep the columns MultiIndex. Your points above seem to imply that you want to flatten your MultIndex. If this is the case, you might want to look at this question.
1:
df.join(df>(3*df.std()), rsuffix='_outlier')
2 and 3:
df.groupby(level=-1, axis=1).sum()
Note that it is not clear from what the first level of the columns MultIndex should be for this.
4:
maxima = df.resample('1Y').max()
maxima.index = maxima.index + pd.DateOffset(hours=23)
maxima = maxima.reindex(df.index, method='bfill')
df.join(df.divide(maxima), rsuffix='_div')

How to loop through a pandas grouped time series?

I have a dataframe like this:
datetime type d13C ... dayofyear week dmy
1 2018-01-05 15:22:30 air -8.88 ... 5 1 5-1-2018
2 2018-01-05 15:23:30 air -9.08 ... 5 1 5-1-2018
3 2018-01-05 15:24:30 air -10.08 ... 5 1 5-1-2018
4 2018-01-05 15:25:30 air -9.51 ... 5 1 5-1-2018
5 2018-01-05 15:26:30 air -9.61 ... 5 1 5-1-2018
... ... ... ... ... ... ...
341543 2018-12-17 12:42:30 air -9.99 ... 351 51 17-12-2018
341544 2018-12-17 12:43:30 air -9.53 ... 351 51 17-12-2018
341545 2018-12-17 12:44:30 air -9.54 ... 351 51 17-12-2018
341546 2018-12-17 12:45:30 air -9.93 ... 351 51 17-12-2018
341547 2018-12-17 12:46:30 air -9.66 ... 351 51 17-12-2018
Full data here: https://drive.google.com/file/d/1KmOwnpvrG2Edz1AlLyD0CKZlBpaFervM/view?usp=sharing
I'm plotting d13C column on the Y-axis and inverse total_co2 on the X and then fitting a regression line for each day in the data. I then filter out and store the dates I want depending on if the r^2 value of the regression line is > 0.8 like this:
import pandas as pd
from numpy.polynomial.polynomial import polyfit
import numpy as np
from scipy import stats
df = pd.read_csv('dataset.txt', usecols = ['datetime', 'type', 'total_co2', 'd13C', 'day','month','year','dayofyear','week','hour'], dtype = {'total_co2':
np.float64, 'd13C':np.float64, 'day':str, 'month':str, 'year':str,'week':str, 'hour': str, 'dayofyear':str})
df['dmy'] = df['day'] +'-'+ df['month'] +'-'+ df['year'] # adding a full date column to make it easir to filter through
# the rows, ie. each day
# window18 = df[((df['year']=='2018'))] # selecting just the data from the year 2018
accepted_dates_list = [] # creating an empty list to store the dates that we're interested in
for d in df['dmy'].unique(): # this will pass through each day, the .unique() ensures that it doesnt go over the same days
acceptable_date = {} # creating a dictionary to store the valid dates
period = df[df.dmy==d] # defining each period from the dmy column
p = (period['total_co2'])**-1
q = period['d13C']
c,m = polyfit(p,q,1) # intercept and gradient calculation of the regression line
slope, intercept, r_value, p_value, std_err = stats.linregress(p, q) # getting some statistical properties of the regression line
if r_value**2 >= 0.8:
acceptable_date['period'] = d # populating the dictionary with the accpeted dates and corresponding other values
acceptable_date['r-squared'] = r_value**2
acceptable_date['intercept'] = intercept
accepted_dates_list.append(acceptable_date) # sending the valid stuff in the dictionary to the list
else:
pass
accepted_dates18 = pd.DataFrame(accepted_dates_list) # converting the list to a df
print(accepted_dates18)
But now I want to do the same thing, just over three day periods which I'm trying to select from the day of year column (unsure if this is the best way or not). For example, I would want to fit the regression line using all the rows with dayofyear=5, dayofyear=6, dayofyear=7, then for the next three days until the end of the data. There are some days missing, but essentially I just need to do this for every 3 days in the data.
The output dataframe I am then trying to get would have the list of the three day intervals with the r^2 >0.8, so anything like this that will show the valid date range:
Accepted dates
0 23-08-2018 - 25-08-2018
1 26-08-2018 - 28-08-2018
2 31-08-2018 - 02-09-2018
3 15-09-2018 - 17-09-2018
4 24-09-2018 - 26-09-2018
I'm not too sure what to do to iterate over every three days. Any help would go a long way, thanks!
Your code loops through a list of unique dates and filters the dataframe on each iteration.
Pandas implemented this with df.groupby(). It can be used to loop and get each group or it can be combined with aggregations, function applications, and transformations. You can read more about it on the user guide. This function can return groups according to any the columns (or set of columns) in df, levels of the index, or any other exogenous list-like with the same length as df (we are grouping rows, but note it can also group columns). It even has implementations for the most common statistical aggregations like mean, stdev, and corr, among many others.
Now to your problem. You not only want the correlation but the equation, so you do need to loop. And to get three-day groups you can use that dayofyear column with a twist.
Take this data
import io
fo = io.StringIO(
'''datetime,d13C
2018-01-05 15:22:30,-8.88
2018-01-05 15:23:30,-9.08
2018-01-06 15:24:30,-10.0
2018-01-06 15:25:30,-9.51
2018-01-07 15:26:30,-9.61
2018-01-07 15:27:30,-9.61
2018-01-08 15:28:30,-9.61
2018-01-08 15:29:30,-9.61
2018-01-09 15:26:30,-9.61
2018-01-09 15:27:30,-9.61
''')
df = pd.read_csv(fo)
df.datetime = pd.to_datetime(df.datetime)
fo.close()
With the code for grouping and looping
first_day = 5
days_to_group = 3
for doy, gdf in df.groupby((df.datetime.dt.dayofyear.sub(first_day) // days_to_group)
* days_to_group + first_day):
print(gdf, '\n')
print(doy, '\n')
Output
datetime d13C
0 2018-01-05 15:22:30 -8.88
1 2018-01-05 15:23:30 -9.08
2 2018-01-06 15:24:30 -10.00
3 2018-01-06 15:25:30 -9.51
4 2018-01-07 15:26:30 -9.61
5 2018-01-07 15:27:30 -9.61
5
datetime d13C
6 2018-01-08 15:28:30 -9.61
7 2018-01-08 15:29:30 -9.61
8 2018-01-09 15:26:30 -9.61
9 2018-01-09 15:27:30 -9.61
8
Now you can plug your code into this loop and get what you need.
PS
You can also use df.datetime.dt.floor('3d') as the grouper but I am not aware of how to control the first_day, so use it with caution.
Here is one approach. As I understand it, the primary goal is to get from current observations (multiple per day) to a 3-day moving average. First, I created a smaller, simpler data set:
import pandas as pd
df = pd.DataFrame({'counter': [*range(100)],
'date': pd.date_range('2020-01-01', periods=100, freq='7H')})
df = df.set_index('date')
print(df.head())
counter
date
2020-01-01 00:00:00 0
2020-01-01 07:00:00 1
2020-01-01 14:00:00 2
2020-01-01 21:00:00 3
2020-01-02 04:00:00 4
Second, I re-sampled on a daily basis:
df2 = df['counter'].resample('1D').mean() # <-- called df2
print(df2.head())
date
2020-01-01 1.5
2020-01-02 5.0
2020-01-03 8.5
2020-01-04 12.0
2020-01-05 15.5
Freq: D, Name: counter, dtype: float64
Third, I computed mean value for a 3-day moving window:
print(df2.rolling(3).mean().head())
date
2020-01-01 NaN
2020-01-02 NaN
2020-01-03 5.0
2020-01-04 8.5
2020-01-05 12.0
Freq: D, Name: counter, dtype: float64
Seems like resample().mean() and rolling().mean() would be useful in this case.

Not able to use a key from a merged dataframe

I've got two dataframes that both have a date column and an emaX column, when I merge them I get the expected result of a single date column and two emaX columns. But when I try access the date key from the merged dataframe, it returns a KeyError: date.
This is the function that returns the emaX (I have two, but they're nearly identical):
def av_get_ema_20():
ti = TechIndicators(key=TOKEN, output_format="pandas")
emaData20, meta_ema = ti.get_ema(symbol=SYMBOL, interval=INTERVAL, time_period=20, series_type=EMA_TYPE)
ema20renamed = pd.DataFrame(emaData20)
ema20renamed.rename(columns={'EMA': 'ema20'}, inplace=True)
return ema20renamed
Then I merge the two returned dataframes:
mergedDF = pd.merge(av_get_ema_10(), av_get_ema_20(), on=["date"], how="inner")
# TEST LINE
print(mergedDF)
The dataframe that is printed out appears as I expected it to be:
ema10 ema20
date
2020-01-02 11:30:00 3226.5200 NaN
2020-01-02 12:30:00 3229.0927 NaN
2020-01-02 13:30:00 3232.0558 NaN
2020-01-02 14:30:00 3235.0839 NaN
2020-01-02 15:30:00 3239.1668 NaN
... ... ...
2020-03-26 11:30:00 2524.9545 2473.8551
2020-03-26 12:30:00 2533.1755 2483.0279
2020-03-26 13:30:00 2541.2982 2492.0586
2020-03-26 14:30:00 2551.0458 2501.8540
2020-03-26 15:30:00 2565.2866 2513.9983
But then when I attempt to use the merged dataframe (for ex. interating through the dataframe), I get KeyError: date:
for index, row in mergedDF.iterrows():
print(row["date"], row["ema10"], row["ema20"])
Am I misinterpreting the dataframe in some way or is there something else I am supposed to do prior to using the merged set (including the date)? I'm at a loss here.

Merge 2 data frames in pandas

I have 2 data frames: GPS coordinates
Time X Y Z
2013-06-01 00:00:00 13512.466575 -12220.845913 19279.970720
2013-06-01 00:00:00 -13529.778408 -14013.560399 -18060.112972
2013-06-01 00:00:00 25108.907276 8764.536182 1594.215305
2013-06-01 00:00:00 -8436.586675 -22468.562354 -11354.726511
2013-06-01 00:05:00 13559.288748 -11476.738832 19702.063737
2013-06-01 00:05:00 -13500.120049 -14702.564328 -17548.488127
2013-06-01 00:05:00 25128.357948 8883.802142 664.732379
2013-06-01 00:05:00 -8346.854582 -22878.993160 -10544.640975
and Glonass coordinates
Time X Y Z
2013-06-01 00:00:00 0.248752905273E+05 -0.557450976562E+04 -0.726176757812E+03
2013-06-01 00:15:00 0.148314306641E+05 0.510153710938E+04 0.201156157227E+05
2013-06-01 00:15:00 0.242346674805E+05 -0.562089208984E+04 0.561714257812E+04
2013-06-01 00:15:00 0.195601284180E+05 -0.122148081055E+05 -0.108823476562E+05
2013-06-01 00:15:00 0.336192968750E+04 -0.122589394531E+05 -0.220986958008E+05
and I need to merge them according to column Time - to get the coordinates of satellites from only the same time (I need all GPS coordinates and all Glonass coordinates from particular time), the result from above example should look like this:
Time X_gps Y_gps Z_gps X_glonass Y_glonass Z_glonass
0 2013-06-01 00:00:00 13512.466575 -12220.845913 19279.970720 0.248752905273E+05 -0.557450976562E+04 -0.726176757812E+03
1 2013-06-01 00:00:00 -13529.778408 -14013.560399 -18060.112972
2 2013-06-01 00:00:00 25108.907276 8764.536182 1594.215305
3 2013-06-01 00:00:00 -8436.586675 -22468.562354 -11354.726511
What I ended up doing is coord = pd.merge(d_gps, d_glonass, on = 'Time', how = 'inner', suffixes = ('_gps','_glonass')) but it copies glonass coordinates to fulfill empty spaces in data frame. What should I change to get the result I want?
I'm new to pandas so I really need your help.
After merging (I took the liberty of renaming the columns first), you can then iterate over the columns, test for duplicated and set these to NaN, you can't set to be blank as the column dtype is a float and setting to a blank string will raise invalid literal error:
In [272]:
df1 = df1.rename(columns={'X':'X_glonass', 'Y':'Y_glonass', 'Z':'Z_glonass'})
df = df.rename(columns={'X':'X_gps', 'Y':'Y_gps', 'Z':'Z_gps'})
merged = df.merge(df1, on='Time')
In [278]:
for col in merged.columns[1:]:
merged.loc[merged[col].duplicated(),col] = np.NaN
merged
Out[278]:
Time X_gps Y_gps Z_gps X_glonass \
0 2013-06-01 13512.466575 -12220.845913 19279.970720 24875.290527
1 2013-06-01 -13529.778408 -14013.560399 -18060.112972 NaN
2 2013-06-01 25108.907276 8764.536182 1594.215305 NaN
3 2013-06-01 -8436.586675 -22468.562354 -11354.726511 NaN
Y_glonass Z_glonass
0 -5574.509766 -726.176758
1 NaN NaN
2 NaN NaN
3 NaN NaN

Categories