Probability of occurrence based on historical data - python

The dataset is of occurrence of particular insects in a location for the given year and month. This is available for about 30 years. Now when I give a random location and year, month of future, I want what is the probability of finding that insects in that place based on the historic data.
I tried to to classification problem by labelling all available data as 1. And wanted to check the probability of new data point being label 1 . But the error was thrown as there should be at least two classes to train.
The data looks like this:The x and y are longitude and latitude
x y year month
17.01 22.87 2013 01
42.32. 33.09 2015 12

Think about the problem as a map. You'll need a map for each time period you're interested in, so sum all the occurrences in each month and year for each location. Unless the locations are already binned, you'll need to use some binning as otherwise it is pretty meaningless. So round the values in x and y to a reasonable precision level or use numpy to bin the data. Then you can create a map with the counts/ use a markov model to predict the occurrence.
The reason you're not getting anywhere at the moment is that the chance of finding an insect at any random point is virtually 0.

Related

How to calculate relative frequency of an event from a dataframe?

I have a dataframe with temperature data for a certain period. With this data, I want to calculate the relative frequency of the month of August being warmer than 20° as well as January being colder than 2°. I have already managed to extract these two columns in a separate dataframe, to get the count of each temperature event and used the normalize function to get the frequency for each value in percent (see code).
df_temp1[df_temp1.aug >=20]
df_temp1[df_temp1.jan <= 2]
df_temp1['aug'].value_counts()
df_temp1['jan'].value_counts()
df_temp1['aug'].value_counts(normalize=True)*100
df_temp1['jan'].value_counts(normalize=True)*100
What I haven't managed is to calculate the relative frequency for aug>=20, jan<=2, as well as aug>=20 AND jan<=2 and aug>=20 OR jan<=2.
Maybe someone could help me with this problem. Thanks.
I would try something like this:
proprortion_of_augusts_above_20 = (df_temp1['aug'] >= 20).mean()
proprortion_of_januaries_below_20 = (df_temp1['jan'] <= 2).mean()
This calculates it in two steps. First, df_temp1['aug'] >= 20 creates a boolean array, with True representing months above 20, and False representing months which are not.
Then, mean() reinterprets True and False as 1 and 0. The average of this is the percentage of months which fulfill the criteria, divided by 100.
As an aside, I would recommend posting your data in a question, which allows people answering to check whether their solution works.

Find specific combination of values in pandas dataframe

I am preparing a dataframe for machine learning. The data set contains weather data from several weather stations in australia over a period of 10 years. One of the measured attributes is Evaporation. It has about 50% missing values.
Now I want to find out, whether the missing values are evenly distributed over all weather stations or if roughly half of the weather stations just never measured Evaporation.
How can I find out about the distribution of a value in combination with another attribute?
I basically want to loop over the weather stations and get a count of NaNs and normal values.
rain_df.query('Location == "Albury"').Location.count()
This gives me the number of measurement points from the weaher station in Albury. Now how can I find out how many NaNs were measured in Albury compared to normal (non-NaN) measurements?
You can use .isnull() to mask a series with True for NaNs and False for everything else. Then you can use .value_counts(normalize=True) to get the proportions of NaN and non NaN in that series.
rain_df.query('Location == "Albury"').Location.isnull().value_counts(normalize=True)

Frequency in pandas timeseries index and statsmodel

I have a pandas timeseries y that does not work well with statsmodel functions.
import statsmodels.api as sm
y.tail(10)
2019-09-20 7.854
2019-10-01 44.559
2019-10-10 46.910
2019-10-20 49.053
2019-11-01 24.881
2019-11-10 52.882
2019-11-20 84.779
2019-12-01 56.215
2019-12-10 23.347
2019-12-20 31.051
Name: mean_rainfall, dtype: float64
I verify that it is indeed a timeseries
type(y)
pandas.core.series.Series
type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex
From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output
plot_acf(y, lags=72, alpha=0.05)
However, when I try to pass this exact same object y to SARIMA
mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()
I get the following error:
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?
I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible
First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:
y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
ts['timesteps'] = ts['Date']
ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
ts.set_index('timesteps', inplace=True )
ts = ts.resample('d').ffill()
ts
ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
print(ts.head(50))
daily_rain is now the target column and the index i.e. timesteps is the timestamp.
The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
y.tail(10)
2019-09-01 7.854
2019-09-02 44.559
2019-09-03 46.910
2019-09-04 49.053
2019-09-05 24.881
2019-09-06 52.882
2019-09-07 84.779
2019-09-08 56.215
2019-09-09 23.347
2019-09-10 31.051
Name: mean_rainfall, dtype: float64
I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!
The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!
As I can see, the package uses the frequency as a premise for everything, since it's a time-series problem.
So you will not be able to use it with data of different frequencies. In fact, you will have to make an assumption for your analysis to adequate your data for the use. Some options are:
1) Consider 3 different analyses (1st days, 10th days, 20th days individually) and use 30d frequency.
2) As you have ~10d equally separated data, you can consider using some kind of interpolation and then make downsampling to a frequency of 1d. Of course, this option only makes sense depending on the nature of your problem and how quickly your data change.
Either way, I just would like to point out that how you model your problem and your data is a key thing when dealing with time series and data science in general. In my experience as a data scientist, I can say that is analyzing at the domain (where your data came from) that you can have a feeling of which approach will work better.

How to calculate the variance of stocks of the last 36 months (some monthly data are missing)?

I downloaded some stock data from CRSP and need the variance of the stock returns of the last 36 months of that company.
So, basically the variance based on two conditions:
Same PERMCO (company number)
Monthly stock returns of the last 3 years.
However, I excluded penny stocks from my sample (stocks with prices < $2). Hence, sometimes months are missing and e.g. april and junes monthly returns are directly on top of each other.
If I am not mistaken, a rolling function (grouped by Permco) would just take the 36 monthly returns above. But when months are missing, the rolling function would actually take more than 3 years data (since the last 36 monthly returns would exceed that timeframe).
Usually I work with Ms Excel. However, in this case the amount of data is too big and it takes years to let Excel calculate stuff. Thats why I want to tackle that problem with Python.
The sample is organized as follows:
PERMNO date SHRCD PERMCO PRC RET
When I have figured out how to make a proper table in here I will show you a sample of my data.
What I have tried so far:
data["RET"]=data["RET"].replace(["C","B"], np.nan)
data["date"] = pd.to_datetime(date["date"])
data=data.sort_values[("PERMCO" , "date"]).reset_index()
L3Yvariance=data.groupby("PERMCO")["RET"].rolling(36).var().reset_index()
Sometimes there are C and B instead of actual returns, thats why the first line
You can replace the missing values by the mean value. It won't affect the variance as the variance is calculated after subtracting the mean, so in this case, for times you won't have the value, the contribution to variance will be 0.

Removing points which deviate too much from adjacent point in Pandas

So I'm doing some time series analysis in Pandas and have a peculiar pattern of outliers which I'd like to remove. The bellow plot is based on a dataframe with the first column as a date and the second column the data
AS you can see those points of similar values interspersed and look like lines are likely instrument quirks and should be removed. Ive tried using both rolling_mean, median and removal based on standard deviation to no avail. For an idea of density, its daily measurements from 1984 to the present. Any ideas?
auge = pd.read_csv('GaugeData.csv', parse_dates=[0], header=None)
gauge.columns = ['Date', 'Gauge']
gauge = gauge.set_index(['Date'])
gauge['1990':'1995'].plot(style='*')
And the result of applying rolling median
gauge = pd.rolling_mean(gauge, 5, center=True)#gauge.diff()
gauge['1990':'1995'].plot(style='*')
After rolling median
You can demand that each data point has at least "N" "nearby" data points within a certain distance "D".
N can be 2 or more.
nearby for element gauge[i] can be a pair like: gauge[i-1] and gauge[i+1], but since some only have neighbors on one side you can ask for at least two elements with distance in indexes (dates) less than 2. So, let's say at least 2 of {gauge[i-2], gauge[i-1] gauge[i+1], gauge[i+2]} should satisfy: Distance(gauge[i], gauge[ix]) < D
D - you can decide this based on how close you expect those real data points to be.
It won't be perfect, but it should get most of the noise out of the dataset.

Categories