so I am working on COVID-19 data of the state of Texas, USA.
I have been given 2 hypotheses to work on
A higher hospitalization rate gives a higher fatality rate
A higher ICU rate gives a higher fatality rate.
Fatality Data - https://dshs.texas.gov/coronavirus/TexasCOVID19DailyCountyFatalityCountData.xlsx
Hospitalization / ICU Data - https://dshs.texas.gov/coronavirus/CombinedHospitalDataoverTimebyTSA.xlsx
So the basic approach to proving these hypotheses should be to compare Cumulative/per day Fatality data vs cumulative/per day hospitalization / ICU Data.
The main issue with this is fatality data is given cumulative cumsum while hospitalization/icu data is active number per day. Is there anyway these two can be compared if yes then how? Or is there anything we can do about it?
A cumulated data is the cumsum version of a per-day data, and reciprocally, a per-day data is a differential cumsum data.
I assume the number of cumulated fatalities is accumulated per day, so you can extract the per-day # of fatalities with differential (e.g. np.diff). This way, every data will be a daily number. Note that in this case, you will end up with one missing point (at the end).
You can also decide to accumulate the # of hospitalization or ICU data with cumsum to be compared with the cumulated # of facilities.
Related
I have a pandas timeseries y that does not work well with statsmodel functions.
import statsmodels.api as sm
y.tail(10)
2019-09-20 7.854
2019-10-01 44.559
2019-10-10 46.910
2019-10-20 49.053
2019-11-01 24.881
2019-11-10 52.882
2019-11-20 84.779
2019-12-01 56.215
2019-12-10 23.347
2019-12-20 31.051
Name: mean_rainfall, dtype: float64
I verify that it is indeed a timeseries
type(y)
pandas.core.series.Series
type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex
From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output
plot_acf(y, lags=72, alpha=0.05)
However, when I try to pass this exact same object y to SARIMA
mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()
I get the following error:
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?
I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible
First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:
y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
ts['timesteps'] = ts['Date']
ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
ts.set_index('timesteps', inplace=True )
ts = ts.resample('d').ffill()
ts
ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
print(ts.head(50))
daily_rain is now the target column and the index i.e. timesteps is the timestamp.
The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
y.tail(10)
2019-09-01 7.854
2019-09-02 44.559
2019-09-03 46.910
2019-09-04 49.053
2019-09-05 24.881
2019-09-06 52.882
2019-09-07 84.779
2019-09-08 56.215
2019-09-09 23.347
2019-09-10 31.051
Name: mean_rainfall, dtype: float64
I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!
The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!
As I can see, the package uses the frequency as a premise for everything, since it's a time-series problem.
So you will not be able to use it with data of different frequencies. In fact, you will have to make an assumption for your analysis to adequate your data for the use. Some options are:
1) Consider 3 different analyses (1st days, 10th days, 20th days individually) and use 30d frequency.
2) As you have ~10d equally separated data, you can consider using some kind of interpolation and then make downsampling to a frequency of 1d. Of course, this option only makes sense depending on the nature of your problem and how quickly your data change.
Either way, I just would like to point out that how you model your problem and your data is a key thing when dealing with time series and data science in general. In my experience as a data scientist, I can say that is analyzing at the domain (where your data came from) that you can have a feeling of which approach will work better.
I downloaded some stock data from CRSP and need the variance of the stock returns of the last 36 months of that company.
So, basically the variance based on two conditions:
Same PERMCO (company number)
Monthly stock returns of the last 3 years.
However, I excluded penny stocks from my sample (stocks with prices < $2). Hence, sometimes months are missing and e.g. april and junes monthly returns are directly on top of each other.
If I am not mistaken, a rolling function (grouped by Permco) would just take the 36 monthly returns above. But when months are missing, the rolling function would actually take more than 3 years data (since the last 36 monthly returns would exceed that timeframe).
Usually I work with Ms Excel. However, in this case the amount of data is too big and it takes years to let Excel calculate stuff. Thats why I want to tackle that problem with Python.
The sample is organized as follows:
PERMNO date SHRCD PERMCO PRC RET
When I have figured out how to make a proper table in here I will show you a sample of my data.
What I have tried so far:
data["RET"]=data["RET"].replace(["C","B"], np.nan)
data["date"] = pd.to_datetime(date["date"])
data=data.sort_values[("PERMCO" , "date"]).reset_index()
L3Yvariance=data.groupby("PERMCO")["RET"].rolling(36).var().reset_index()
Sometimes there are C and B instead of actual returns, thats why the first line
You can replace the missing values by the mean value. It won't affect the variance as the variance is calculated after subtracting the mean, so in this case, for times you won't have the value, the contribution to variance will be 0.
I am trying to find anomalies in a huge sales-transactions dataset (more than 1 million observations), with thousands of unique customers. Same customer can purchase multiple times on the same date. Dataset contains a mix of both random and seasonal transactions. A dummy sample of my data is below:
Date CustomerID TransactionType CompanyAccountNum Amount
01.01.19 1 Sales 111xxx 100
01.01.19 1 Credit 111xxx -3100
01.01.19 4 Sales 111xxx 100
02.01.19 3 Sales 311xxx 100
02.01.19 1 Refund 211xxx -2100
03.01.19 4 Sales 211xxx 3100
Which algorithm/approach would suit this problem best? I have tried a multivariate FBprophet model (on python) so far and received less-than-satisfactory results.
You may try the pyod package, methods like isolation forest or HBOS.
It's advertised as 'a comprehensive and scalable Python toolkit for detecting outlying objects in multivariate data' but your mileage may vary in terms of performance, so first check out their benchmarks.
If you have time series data, it is better to apply some methods like moving average or exponential smoothing your data to remove trends and seasonalities. Otherwise, all data points involved in seasonality or trend periods will be labeled as anomalies.
The dataset is of occurrence of particular insects in a location for the given year and month. This is available for about 30 years. Now when I give a random location and year, month of future, I want what is the probability of finding that insects in that place based on the historic data.
I tried to to classification problem by labelling all available data as 1. And wanted to check the probability of new data point being label 1 . But the error was thrown as there should be at least two classes to train.
The data looks like this:The x and y are longitude and latitude
x y year month
17.01 22.87 2013 01
42.32. 33.09 2015 12
Think about the problem as a map. You'll need a map for each time period you're interested in, so sum all the occurrences in each month and year for each location. Unless the locations are already binned, you'll need to use some binning as otherwise it is pretty meaningless. So round the values in x and y to a reasonable precision level or use numpy to bin the data. Then you can create a map with the counts/ use a markov model to predict the occurrence.
The reason you're not getting anywhere at the moment is that the chance of finding an insect at any random point is virtually 0.
I have a DataFrame which contains a column that is the rolling standard deviation from day to day. (i.e. df['daily_std'] = df['value'].rolling(2).std())
However, when I retrieve the DataFrame, it no longer contains the original values, df['value'], and they cannot be retrieved.
I need to get the single, total standard deviation for the entire series of original values (i.e. df['total_std'] = df['value'].std(). Is it possible, mathematically, to somehow calculate this total standard deviation from the individual rolling daily standard deviations, since the original values are no longer available?