How to detect anomalies in multivariate, multiple time-series data? - python

I am trying to find anomalies in a huge sales-transactions dataset (more than 1 million observations), with thousands of unique customers. Same customer can purchase multiple times on the same date. Dataset contains a mix of both random and seasonal transactions. A dummy sample of my data is below:
Date CustomerID TransactionType CompanyAccountNum Amount
01.01.19 1 Sales 111xxx 100
01.01.19 1 Credit 111xxx -3100
01.01.19 4 Sales 111xxx 100
02.01.19 3 Sales 311xxx 100
02.01.19 1 Refund 211xxx -2100
03.01.19 4 Sales 211xxx 3100
Which algorithm/approach would suit this problem best? I have tried a multivariate FBprophet model (on python) so far and received less-than-satisfactory results.

You may try the pyod package, methods like isolation forest or HBOS.
It's advertised as 'a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data' but your mileage may vary in terms of performance, so first check out their benchmarks.

If you have time series data, it is better to apply some methods like moving average or exponential smoothing your data to remove trends and seasonalities. Otherwise, all data points involved in seasonality or trend periods will be labeled as anomalies.

Related

Is the runtime of df.groupby().transform() linear in the number of groups in the groupby object?

BACKGROUND
I am calculating racial segregation statistics between and within firms using the Theil Index. The data structure is a multi-indexed pandas dataframe. The calculation involves a lot of df.groupby()['foo'].transform(), where the transformation is the entropy function from scipy.stats. I have to calculate entropy on smaller and smaller groups within this structure, which means calling entropy more and more times on the groupby objects. I get the impression that this is O(n), but I wonder whether there is an optimization that I am missing.
EXAMPLE
The key part of this dataframe comprises five variables: county, firm, race, occ, and size. The units of observation are counts: each row tells you the SIZE of the workforce of a given RACE in a given OCCupation in a FIRM in a specific COUNTY. Hence the multiindex:
df = df.set_index(['county', 'firm', 'occ', 'race']).sort_index()
The Theil Index is the size-weighted sum of sub-units' entropy deviations from the unit's entropy. To calculate segregation between counties, for example, you can do this:
from scipy.stats import entropy
from numpy import where
# Helper to calculate the actual components of the Theil statistic
define Hcmp(w_j, w, e_j, e):
return where(e == 0, 0, (w_j / w) * ((e - e_j) / e))
df['size_county'] = df.groupby(['county', 'race'])['size'].transform('sum')
df['size_total'] = df['size'].sum()
# Create a dataframe with observations aggregated over county/race tuples
counties = df.groupby(['county', 'race'])[['size_county', 'size_total']].first()
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4) # <--
# The base for entropy is 4 because there are four recorded racial categories.
# Assume that counties['entropy_total'] has already been calculated.
counties['seg_cmpnt'] = Hcmp(counties['size_county'], counties['size_total'],
counties['entropy_county'], counties['entropy_total'])
county_segregation = counties['seg_cmpnt'].sum()
Focus on this line:
counties['entropy_county'] = counties.groupby('county')['size_county'].transform(entropy, base=4)
The starting dataframe has 3,130,416 rows. When grouped by county, though, the resulting groupby object has just 2,267 groups. This runs quickly enough. When I calculate segregation within counties and between firms, the corresponding line is this:
firms['entropy_firm'] = firms.groupby('firm')['size_firm'].transform(entropy, base=4)
Here, the groupby object has 86,956 groups (the count of firms in the data). This takes about 40 times as long as the prior, which looks suspiciously like O(n). And when I try to calculate segregation within firms, between occupations...
# Grouping by firm and occupation because occupations are not nested within firms
occs['entropy_occ'] = occs.groupby(['firm', 'occ'])['size_occ'].transform(entropy, base=4)
...There are 782,604 groups. Eagle-eyed viewers will notice that this is exactly 1/4th the size of the raw dataset, because I have one observation for each firm/race/occupation tuple, and four racial categories. It is also nine times the number of groups in the by-firm groupby object, because the data break employment out into nine occupational categories.
This calculation takes about nine times as long: four or five minutes. When the underlying research project involves 40-50 years of data, this part of the process can take three or four hours.
THE PROBLEM, RESTATED
I think the issue is that, even though scipy.stats.entropy() is being applied in a smart, vectorized way, the necessity of calculating it over a very large number of small groups--and thus calling it many, many times--is swamping the performance benefits of vectorized calculations.
I could pre-calculate the necessary logarithms that entropy requires, for example with numpy.log(). If I did that, though, I'd still have to group the data to first get each firm/occupation/race's share within the firm/occupation. I would also lose any advantage of readable code that looks similar at different levels of analysis.
Thus my question, stated as clearly as I can: is there a more computationally efficient way to call something like this entropy function, when calculating it over a very large number of relatively small groups in a large dataset?

Time Series Forecasting in python

I have dataset that contins 300 rows and 4 columns: Date, Hour, counts(how many ads were emitted during this hour in TV), Visits (how many visits were made during this hour). Here is example of data:
If I want to test the effect of the tv spots on visits on the website, should I treat it as a time series and use regression for example? And what should the input table look like in that case? I know that I have to divide the date into day and month, but how to treat the counts column, leave them as they are, if my y is to be the number of visits?
Thanks
just to avoid case of single input and single output regression model, you could use hour and counts as input and predict the visits.
I don't know what format are hours in, if they are in 12hrs format convert them to 24hr format before feeding them to your model.
If you want predict the the next dates and hours in the time series, regression models or classical time series model such as ARIMA, ARMA, exponential smoothing would be useful.
But, as you need to predict the effectiveness of tv spot, I recommend to generate features using tsfresh library in python, based on counts to remove the time effect and use a machine learning model to do prediction, such as SVR or Gradient Boosting.
In your problem:
from tsfresh import extract_features
extracted_features = extract_features(df,
column_id="Hour",
column_kind=None,
column_value="Counts",
column_sort="time")
So, your target table will be:
Hour Feature_1 Feature_2 ... Visits(Avg)
0 min(Counts) max(Counts) ... mean(Visits)
1 min(Counts) max(Counts) ... mean(Visits)
2 min(Counts) max(Counts) ... mean(Visits)
min() and max() are just example features, tsfresh could extract many other features. Visit here for more information

How to compare daily active data to cumulative data?

so I am working on COVID-19 data of the state of Texas, USA.
I have been given 2 hypotheses to work on
A higher hospitalization rate gives a higher fatality rate
A higher ICU rate gives a higher fatality rate.
Fatality Data - https://dshs.texas.gov/coronavirus/TexasCOVID19DailyCountyFatalityCountData.xlsx
Hospitalization / ICU Data - https://dshs.texas.gov/coronavirus/CombinedHospitalDataoverTimebyTSA.xlsx
So the basic approach to proving these hypotheses should be to compare Cumulative/per day Fatality data vs cumulative/per day hospitalization / ICU Data.
The main issue with this is fatality data is given cumulative cumsum while hospitalization/icu data is active number per day. Is there anyway these two can be compared if yes then how? Or is there anything we can do about it?
A cumulated data is the cumsum version of a per-day data, and reciprocally, a per-day data is a differential cumsum data.
I assume the number of cumulated fatalities is accumulated per day, so you can extract the per-day # of fatalities with differential (e.g. np.diff). This way, every data will be a daily number. Note that in this case, you will end up with one missing point (at the end).
You can also decide to accumulate the # of hospitalization or ICU data with cumsum to be compared with the cumulated # of facilities.

Frequency in pandas timeseries index and statsmodel

I have a pandas timeseries y that does not work well with statsmodel functions.
import statsmodels.api as sm
y.tail(10)
2019-09-20 7.854
2019-10-01 44.559
2019-10-10 46.910
2019-10-20 49.053
2019-11-01 24.881
2019-11-10 52.882
2019-11-20 84.779
2019-12-01 56.215
2019-12-10 23.347
2019-12-20 31.051
Name: mean_rainfall, dtype: float64
I verify that it is indeed a timeseries
type(y)
pandas.core.series.Series
type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex
From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output
plot_acf(y, lags=72, alpha=0.05)
However, when I try to pass this exact same object y to SARIMA
mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()
I get the following error:
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?
I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible
First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:
y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
ts['timesteps'] = ts['Date']
ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
ts.set_index('timesteps', inplace=True )
ts = ts.resample('d').ffill()
ts
ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
print(ts.head(50))
daily_rain is now the target column and the index i.e. timesteps is the timestamp.
The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
y.tail(10)
2019-09-01 7.854
2019-09-02 44.559
2019-09-03 46.910
2019-09-04 49.053
2019-09-05 24.881
2019-09-06 52.882
2019-09-07 84.779
2019-09-08 56.215
2019-09-09 23.347
2019-09-10 31.051
Name: mean_rainfall, dtype: float64
I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!
The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!
As I can see, the package uses the frequency as a premise for everything, since it's a time-series problem.
So you will not be able to use it with data of different frequencies. In fact, you will have to make an assumption for your analysis to adequate your data for the use. Some options are:
1) Consider 3 different analyses (1st days, 10th days, 20th days individually) and use 30d frequency.
2) As you have ~10d equally separated data, you can consider using some kind of interpolation and then make downsampling to a frequency of 1d. Of course, this option only makes sense depending on the nature of your problem and how quickly your data change.
Either way, I just would like to point out that how you model your problem and your data is a key thing when dealing with time series and data science in general. In my experience as a data scientist, I can say that is analyzing at the domain (where your data came from) that you can have a feeling of which approach will work better.

Predicting churn of customer using Machine Learning with lag

I have data of 5000 customers over time series (monthly) which looks like:
This is my first time dealing with time series data. Can someone explain some strategies for Churn prediction probability (3 months, 6 months) in advance?
I am confused because for every customer churning probability 3 or 6 months in advance will be zero (according to target). So should I see some trends or create lag variables?
But I still don't know if regression, what will be the target variable.
you can use lag function and then define churn and use predictive model to churn. As I can see buying product category, so you can take lag for last 3 month for unique customer id and then define the churn for 3 months

Categories