How to remove outliers specific to each timestamp? - python

I am having the below data frame which is a time-series data and I process this information to input to my prediction models.
df = pd.DataFrame({"timestamp": [pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 01:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 02:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None),
pd.Timestamp('2019-01-01 03:00:00', tz=None)],
"value":[5.4,5.1,100.8,20.12,21.5,80.08,150.09,160.12,20.06]
})
From this, I take the mean of the value for each timestamp and will send the value as the input to the predictor. But currently, I am using just thresholds to filter out the outliers,but those seem to filter out real vales and also not filter some outliers .
For example, I kept
df[(df['value']>3 )& (df['value']<120 )]
and then this does not filter out
2019-01-01 01:00:00 100.8
which is an outlier for that timestamp and does filter out
2019-01-01 03:00:00 150.09
2019-01-01 03:00:00 160.12
which are not outliers for that timestamp.
So how do I filter out outliers for each timestamp based on which does not fit that group?
Any help is appreciated.

Ok, let's assume you are searching for the confidence interval to detect outlier.
Then you have to get the mean and the confidence intervals for each timestamp group. Therefore you can run:
stats = df.groupby(['timestamp'])['value'].agg(['mean', 'count', 'std'])
ci95_hi = []
ci95_lo = []
import math
for i in stats.index:
m, c, s = stats.loc[i]
ci95_hi.append(m + 1.96*s/math.sqrt(c))
ci95_lo.append(m - 1.96*s/math.sqrt(c))
stats['ci95_hi'] = ci95_hi
stats['ci95_lo'] = ci95_lo
df = pd.merge(df, stats, how='left', on='timestamp')
which leads to the following output:
then you can adjust a filter column:
import numpy as np
df['Outlier'] = np.where(df['value'] >= df['ci95_hi'], 1, np.where(df['value']<= df['ci95_lo'], 1, 0))
then everythign with a 1 in the column outlier is an outlier. You can adjust the values with 1.96 to play a little with it.
The outcome looks like:

Related

How to keep correct date when plotting data in Python?

I am working with a dataset with records ranging from 16-02-2022 00:00 to 01/04/2022 11:30. I want to plot the 4-variable daily and monthly means (H, LE, co2, h2o) and then compare them with the same variables from another dataset. I filtered the interest variables for the quality flags and I removed outliers using interquartile range. My problem is that I can't get the real date when I plot the average values. For example, I should get about 2 months, instead I get really more.
As you can see it is not the correct plot of Sensible Heat Flux monthly mean Cycle because I have more or less to months
I used this script to import the data:
datasetBio=pd.read_csv(io.BytesIO(uploaded["eddypro_Bioesame_full_output_exp2.csv"]),sep=';',header = 1, parse_dates= ['day&time'] ,index_col=['day&time'], na_values= "-9999")
and my DateTimeIndex look like this:
DatetimeIndex(['2022-02-16 00:00:00', '2022-02-16 00:30:00',
'2022-02-16 01:00:00', '2022-02-16 01:30:00',
'2022-02-16 02:00:00', '2022-02-16 02:30:00',
'2022-02-16 03:00:00', '2022-02-16 03:30:00',
'2022-02-16 04:00:00', '2022-02-16 04:30:00',
...
'2022-01-04 07:00:00', '2022-01-04 07:30:00',
'2022-01-04 08:00:00', '2022-01-04 08:30:00',
'2022-01-04 09:00:00', '2022-01-04 09:30:00',
'2022-01-04 10:00:00', '2022-01-04 10:30:00',
'2022-01-04 11:00:00', '2022-01-04 11:30:00'],
dtype='datetime64[ns]', name='day&time', length=2136, freq=None)
In my dataset there is also the DOY column, but I tried to use it without success. I tried with datetime moduls, strftime and strptime, but without success.
I also tried with:
#list comprehension
datasetBio['HM'] = ['%s-%s' %(el.hour,el.minute) for el in datasetBio.index]
list_m = []
list_h = []
for el in datasetBio['HM'].unique():
list_m.append(datasetBio['H'].loc[datasetBio['HM']==el].mean())
list_h.append(el)
#Look at the group
for el in datasetBio['HM'].unique():`
print(datasetBio.loc[datasetBio['HM']==el])
partial output:
2022-02-19 12:30:00 0.0 ... 0.575134 0.424103 0.066102 0.041973
2022-02-20 12:30:00 1.0 ... 0.898857 0.551975 0.069380 0.221436
2022-02-21 12:30:00 0.0 ... 221.180000 234.682000 0.369427 0.161920
2022-02-22 12:30:00 1.0 ... 0.521469 0.673882 0.074374 0.312831
2022-02-23 12:30:00 0.0 ... 0.303948 0.630388 0.069664 0.283314
When I try to plot together the variables coming from the 2 datasets obviously the problem of the days remains.
Instead of plotting the correct time parameters, it runs from January to December
Please someone help me to solve this problem because I don't know what to do anymore.
Thanks in advance.

Aggregating Values using Date Ranges in Another Dataframe

I need to sum all values from maindata using master_records. Many values for ids will not get summed even if there are timestamps and values for these columns.
import pandas as pd
#Proxy reference dataframe
master_records = [['site a', '2021-03-05 02:00:00', '2021-03-05 03:00:00'],
['site a', '2021-03-05 06:00:00', '2021-03-05 08:00:00'],
['site b', '2021-04-08 10:00:00', '2021-04-08 13:00:00']]
mst_df = pd.DataFrame(master_records, columns = ['id', 'start', 'end'])
mst_df['start'] = pd.to_datetime(mst_df['start'], infer_datetime_format=True)
mst_df['end'] = pd.to_datetime(mst_df['end'], infer_datetime_format=True)
#Proxy main high frequency dataframe
main_data = [['id a','2021-03-05 00:00:00', 10], #not aggregated
['id a','2021-03-05 01:00:00', 19], #not aggregated
['id a','2021-03-05 02:00:00', 9],
['id a','2021-03-05 03:00:00', 16],
['id a','2021-03-05 04:00:00', 16], #not aggregated
['id a','2021-03-05 05:00:00', 11], #not aggregated
['id a','2021-03-05 06:00:00', 16],
['id a','2021-03-05 07:00:00', 12],
['id a','2021-03-05 08:00:00', 9],
['id b','2021-04-08 10:00:00', 11],
['id b','2021-04-08 11:00:00', 10],
['id b','2021-04-08 12:00:00', 19],
['id b','2021-04-08 13:00:00', 10],
['id b','2021-04-08 14:00:00', 16]] #not aggregated
# Create the pandas DataFrame
maindata = pd.DataFrame(main_data, columns = ['id', 'timestamp', 'value'])
maindata['timestamp'] = pd.to_datetime(maindata['timestamp'], infer_datetime_format=True)
The desired DataFrame looks like:
print(mst_df)
id start end sum(value)
0 site a 2021-03-05 02:00:00 2021-03-05 03:00:00 25
1 site a 2021-03-05 06:00:00 2021-03-05 08:00:00 37
2 site b 2021-04-08 10:00:00 2021-04-08 13:00:00 50
The "id"s don't match; so first we create a column in both DataFrames to get a matching ID; then merge on the matching "id"s; then filter the merged DataFrame on the rows where the timestamps are between "start" and "end". Finally groupby + sum will fetch the desired outcome:
maindata['id_letter'] = maindata['id'].str.split().str[-1]
mst_df['id_letter'] = mst_df['id'].str.split().str[-1]
merged = mst_df.merge(maindata, on='id_letter', suffixes=('','_'))
out = (merged[merged['timestamp'].between(merged['start'], merged['end'])]
.groupby(['id','start','end'], as_index=False)['value'].sum())
Output:
id start end value
0 site a 2021-03-05 02:00:00 2021-03-05 03:00:00 25
1 site a 2021-03-05 06:00:00 2021-03-05 08:00:00 37
2 site b 2021-04-08 10:00:00 2021-04-08 13:00:00 50

pandas Dataframe redesign/reorder issue [closed]

Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 2 years ago.
Improve this question
I have a Dataframe like this one :
raw Dataframe
and want to get a Dataframe like this manual crafted one.
wanted outcome
i have tried to do it with the follwing commands
SD = station_data.set_index('ELEMENT')
SD.reset_index(inplace=True)
SD = SD.groupby(['ELEMENT'])
result = SD['VALUE'].unique()
result.reset_index(inplace=False)
SD = pd.DataFrame(result)
SD = SD.transpose()
the problem with this is , that .unique() is wrong cause you can get the same value multiple times and the values aren't a single object anymore.so i'm searching a way to get the wanted Dataframe with all values.
the output of the commandblock is:
VALUE
ELEMENT
HUMIDITY [98.0, 97.0, 96.0, 95.0, 94.0, 92.0, 93.0, 91....
PRECIPITATION_FORM [<NA>]
PRECIPITATION_HEIGHT [0.0, 0.1, 0.2, 0.3, 0.5, 0.4, 0.6, 0.8, 1.5,...
...
let us use the following gutted Dataframe as example.
import pandas as pd
test = pd.DataFrame({'STATION_ID': [1207, 1207, 1207,1207, 1207, 1207,1207, 1207, 1207],
'DATE': ['2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0','2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0','2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0'],
'ELEMENT': ['TEMPERATURE_AIR_200','TEMPERATURE_AIR_200','TEMPERATURE_AIR_200', 'HUMIDITY','HUMIDITY','HUMIDITY', 'TEMPERATURE_DEW_POINT_200', 'TEMPERATURE_DEW_POINT_200', 'TEMPERATURE_DEW_POINT_200'],
'VALUE': [0.1, 0.1, 0.4, 98.0, 96.0, 98.0, -0.3,0.1,0.4]})
You currently have data in "long" format, and you want your data in "wide" format. This can be done in Pandas using either pivot_table() or unstack().
Here's an example of how to do it with unstack().
import pandas as pd
test = pd.DataFrame({'STATION_ID': [1207, 1207, 1207,1207, 1207, 1207,1207, 1207, 1207],
'DATE': ['2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0','2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0','2019-01-01 00:00:00', '2019-01-01 01:00:00', '2019-01-01 02:00:0'],
'ELEMENT': ['TEMPERATURE_AIR_200','TEMPERATURE_AIR_200','TEMPERATURE_AIR_200', 'HUMIDITY','HUMIDITY','HUMIDITY', 'TEMPERATURE_DEW_POINT_200', 'TEMPERATURE_DEW_POINT_200', 'TEMPERATURE_DEW_POINT_200'],
'VALUE': [0.1, 0.1, 0.4, 98.0, 96.0, 98.0, -0.3,0.1,0.4]})
# Set indexes so pandas knows which columns to unstack by
test = test.set_index(['STATION_ID', 'DATE', 'ELEMENT'])
test = test.unstack()
# The column names are now a multiindex. Fix that
test.columns = test.columns.get_level_values(1)
# Put the index back how it was. Optional.
test = test.reset_index()
print(test)
Output:
ELEMENT STATION_ID DATE HUMIDITY TEMPERATURE_AIR_200 TEMPERATURE_DEW_POINT_200
0 1207 2019-01-01 00:00:00 98.0 0.1 -0.3
1 1207 2019-01-01 01:00:00 96.0 0.1 0.1
2 1207 2019-01-01 02:00:0 98.0 0.4 0.4

Python Pandas: detecting frequency of time series

Assume I have loaded time series data from SQL or CSV (not created in Python), the index would be:
DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
'2015-03-02 02:00:00', '2015-03-02 03:00:00',
'2015-03-02 04:00:00', '2015-03-02 05:00:00',
'2015-03-02 06:00:00', '2015-03-02 07:00:00',
'2015-03-02 08:00:00', '2015-03-02 09:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)
As you can see, the freq is None. I am wondering how can I detect the frequency of this series and set the freq as its frequency. If possible, I would like this to work in the case of data which isn't continuous (there are plenty of breaks in the series).
I was trying to find the mode of all the differences between two timestamps, but I am not sure how to transfer it into a format that is readable by Series
It is worth mentioning that if data is continuous, you can use pandas.DateTimeIndex.inferred_freq property:
dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'
or pandas.infer_freq method:
pd.infer_freq(dt_ix)
Out[3]: 'H'
If not continuous pandas.infer_freq will return None. Similarly to what has been proposed yet, another alternative is using pandas.Series.diff method:
split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')
Maybe try taking difference of the timeindex and use the mode (or smallest difference) as the freq.
import pandas as pd
import numpy as np
# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df
col
2015-03-02 01:00:00 2.0261
2015-03-02 04:00:00 1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00 0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00 1.8453
... ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00 1.1962
2015-07-19 15:00:00 1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00 0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00 0.5049
2015-07-19 23:00:00 -0.5349
[2000 rows x 1 columns]
# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()
01:00:00 1181
02:00:00 499
03:00:00 180
04:00:00 93
05:00:00 24
06:00:00 10
07:00:00 9
08:00:00 3
dtype: int64
# the mode can be considered as frequency
res.index[0] # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min() # output: Timedelta('0 days 01:00:00')
# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng
DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
'2015-03-02 03:00:00', '2015-03-02 04:00:00',
'2015-03-02 05:00:00', '2015-03-02 06:00:00',
'2015-03-02 07:00:00', '2015-03-02 08:00:00',
'2015-03-02 09:00:00', '2015-03-02 10:00:00',
...
'2015-07-19 14:00:00', '2015-07-19 15:00:00',
'2015-07-19 16:00:00', '2015-07-19 17:00:00',
'2015-07-19 18:00:00', '2015-07-19 19:00:00',
'2015-07-19 20:00:00', '2015-07-19 21:00:00',
'2015-07-19 22:00:00', '2015-07-19 23:00:00'],
dtype='datetime64[ns]', length=3359, freq='H', tz=None)
The minimum time difference is found with
np.diff(data.index.values).min()
which is normally in units of ns. To get a frequency, assuming ns:
freq = 1e9 / np.diff(df.index.values).min().astype(int)

numpy.datetime64: how to get weekday of numpy datetime64 and check if it's between time1 and time2

how to check if a numpy datetime is between time1 and time2(without date).
Say I have a series of datetime, i want to check its weekday, and whether it's between 13:00 and 13:30. For example
2014-03-05 22:55:00
is Wed and it's not between 13:00 and 13:30
Using pandas, you could use the DatetimeIndex.indexer_between_time method to find those dates whose time is between 13:00 and 13:30.
For example,
import pandas as pd
dates = pd.date_range('2014-3-1 00:00:00', '2014-3-8 0:00:00', freq='50T')
dates_between = dates[dates.indexer_between_time('13:00','13:30')]
wednesdays_between = dates_between[dates_between.weekday == 2]
These are the first 5 items in dates:
In [95]: dates.tolist()[:5]
Out[95]:
[Timestamp('2014-03-01 00:00:00', tz=None),
Timestamp('2014-03-01 00:50:00', tz=None),
Timestamp('2014-03-01 01:40:00', tz=None),
Timestamp('2014-03-01 02:30:00', tz=None),
Timestamp('2014-03-01 03:20:00', tz=None)]
Notice that these dates are all between 13:00 and 13:30:
In [96]: dates_between.tolist()[:5]
Out[96]:
[Timestamp('2014-03-01 13:20:00', tz=None),
Timestamp('2014-03-02 13:30:00', tz=None),
Timestamp('2014-03-04 13:00:00', tz=None),
Timestamp('2014-03-05 13:10:00', tz=None),
Timestamp('2014-03-06 13:20:00', tz=None)]
And of those dates, here is the only one that is a Wednesday:
In [99]: wednesdays_between.tolist()
Out[99]: [Timestamp('2014-03-05 13:10:00', tz=None)]

Categories