Finding daily patterns with machine learning - python

I have created a huge log of daily activity in the format [timestamp, location]. For example
[{1365650747255, 'san francisco'},
{1365650743354, 'san francisco'},
{1365650741349, 'san mateo'},
{1365650756324, 'mountain view'},
...
{1365650813354, 'menlo park'}]
What are the ways I can mine this information to find patterns like
"On Sunday evenings, it's probable that I am near San Francisco"
"On Monday afternoons it's probable that I am near Menlo Park"
The problem is that
The dataset is huge.
it looks impossible to judge the date/time/day by applying a function on the timestamp value (unless we decode the timestamp in to Date Time values).

I do not see your problem here. As it is a timestamp counting seconds from epoch you only have to apply the modulo operator with the value being the range of interest. If you train a classifier on that you should be able to predict every upcoming place. The main problem is not performance, as the learning is only done now and then, but how to update the learned dataset.
As already stated you do not have to use machine learning for this however if you want to do it using machine learning this can basically be done using a k-nearest-neighbor on your 1d dataset.
[EDIT]:
Mixed up languages but fixed it: A classifier is the algorithm which will do the statistical classification.
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known.[1]
As I only have used sklearn to do such things the following is a minimalistic example of how you could use a k-nearest-neighbor classifier [2]. To be able to classify you have to change the strings into numbers, then train your classifier on the given test dataset and afterwards you are able to predict the location for a new given timestamp.
import numpy as np
from sklearn.neighbors import KNeighborsClassifier
data = [[1365650747255, 'san francisco'],
[1365650743354, 'san francisco'],
[1365650741349, 'san mateo'],
[1365650756324, 'mountain view'],
...
[1365650813354, 'menlo park']]
# Map location strings to integers and replace
location_mapping = {}
location_index = 0
for index, (time, location) in enumerate(data):
if(not location_mapping.has_key(location)):
location_mapping[location] = location_index
location_index += 1
data[index][1] = location_mapping[location]
inverse_location_mapping = {value:key for key, value in location_mapping.items()}
data = np.array(data)
week = 60 * 60 * 24 * 7
# Setup classifier
classifier = KNeighborsClassifier(n_neighbors=10)
# Train classifier on given data
classifier.fit(data[:, 0] % week, data[:, 1])
# Predict desired location
prediction = classifier.predict([[1365444444444 % week]]))
print(inverse_location_mapping[prediction])
[1] : http://en.wikipedia.org/wiki/Statistical_classification
[2] : http://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

The performance is this solution depends on how granular your requirement for pattern recognition is.
Lets assume your requirement is dividing the day into 4 parts :
Morning,Noon,Evening,Night, lets call them time_slots
Now lets take a look at how big your daily activity log is, 1 year, 2 years , 3 years ?
lets assume it is 1 year.
So we have total of 365 * 4 = 1460 timeslots to monitor.
Now,create a simple map based on timestamps for each time_slot.
Eg. It begins on T1 and ends on T2 ( where T1 and T2 are timestamps like 1365650813354 ).
Based on timestamp value in your log, it is easy to find its time_slot i.e. Evening of 28th January, or
Morning of 30th January.
You will have to store time_slot vs place_i_was data in any suitable database with proper schema.
That depends on kind of querying and analylsis you would want.
This way you will not need to run formulas on your dataset, and the predefined map/database lookup will serve your purpose.

Not sure these questions require machine learning, you can use regular statistics for that. I.e. build a probability distribution plot, x - time of day, y - probability it is San Francisco. Calculate the probability of San Francisco if time is between a and b...
This is how to load your data in pandas DataFrame:
from __future__ import print_function, division
import pandas as pd
import datetime
df = pd.read_csv("data.csv",
names=["timestamp","location"],
parse_dates=["timestamp"],
date_parser=lambda x:datetime.datetime.fromtimestamp(int(x) / 1000))
print(df.head())
Outputs:
timestamp location
0 2013-04-11 04:25:47.255000 "san francisco"
1 2013-04-11 04:25:43.354000 "san francisco"
2 2013-04-11 04:25:41.349000 "san mateo"
3 2013-04-11 04:25:56.324000 "mountain view"
4 2013-04-11 04:26:53.354000 "menlo park"

Convert the timestamps into tokens: "sunday morning".
Then do association rule mining to obtain rules such as
night => home
sunday morning => running in the park
where you only keep those rules, where the desired locations occur on the right.

Firstly, convert the timestamp value to year-month-weekday. Replace the timestamp column by 3 columns corresponding to year, month and weekday.
Later you could simply group by certain range of values for dates and count the number of instances for each location.

Related

Is there a way to efficiently apply a function to 3 million values in a Pandas column?

I am currently working on a course in Data Science on how to win data science competitions. The final project is a Kaggle competition that we have to participate in.
My training dataset has close to 3 million rows, and one of the columns is a "date of purchase" column.
I want to calculate the distance of each date to the nearest public holiday.
E.g. if the date is 31/12/2014, the nearest PH would be 01/01/2015. The number of days apart would be "1".
I cannot think of an efficient way to do this operation. I have a list with a number of Timestamps, each one is a public holiday in Russia (the dataset is from Russia).
def dateDifference (target_date_raw):
abs_deltas_from_target_date = np.subtract(russian_public_holidays, target_date_raw)
abs_deltas_from_target_date = [i.days for i in abs_deltas_from_target_date if i.days >= 0]
index_of_min_delta_from_target_date = np.min(abs_deltas_from_target_date)
return index_of_min_delta_from_target_date
where 'russian_public_holidays' is the list of public holiday dates and 'target_date_raw' is the date for which I want to calculate distance to the nearest public holiday.
This is the code I use to create a new column in my DataFrame for the difference of dates.
training_data['closest_public_holiday'] = [dateDifference(i) for i in training_data['date']]
This code ran for nearly 25 minutes and showed no signs of completing, which is why I turn to you guys for help.
I understand that this is probably the least Pandorable way of doing things, but I couldn't really find a clean way of operating on a single column during my research. I saw a lot of people say that using the "apply" function on a single column is a bad way of doing things. I am very new to working with such large datasets, which is why clean and efficient practices seem to elude me for now. Please do let me know what would be the best way to tackle this!
Try this and see if helps with the timing. I worry that it will take up to much memory. I don't have the data to test. You can try.
df = pd.DataFrame(pd.date_range('01/01/2021','12/31/2021',freq='M'),columns=['Date'])
holidays = pd.to_datetime(np.array(['1/1/2021','12/25/2021','8/9/2021'])).to_numpy()
Assuming holidays: 1/1/2021, 8/9/2021, 12/25/2021
df['Days Away'] = (
np.min(np.absolute(df.Date.to_numpy()
.reshape(-1,1) - holidays),axis=1) /
np.timedelta64(1, 'D')
)

Time Series Forecasting in python

I have dataset that contins 300 rows and 4 columns: Date, Hour, counts(how many ads were emitted during this hour in TV), Visits (how many visits were made during this hour). Here is example of data:
If I want to test the effect of the tv spots on visits on the website, should I treat it as a time series and use regression for example? And what should the input table look like in that case? I know that I have to divide the date into day and month, but how to treat the counts column, leave them as they are, if my y is to be the number of visits?
Thanks
just to avoid case of single input and single output regression model, you could use hour and counts as input and predict the visits.
I don't know what format are hours in, if they are in 12hrs format convert them to 24hr format before feeding them to your model.
If you want predict the the next dates and hours in the time series, regression models or classical time series model such as ARIMA, ARMA, exponential smoothing would be useful.
But, as you need to predict the effectiveness of tv spot, I recommend to generate features using tsfresh library in python, based on counts to remove the time effect and use a machine learning model to do prediction, such as SVR or Gradient Boosting.
In your problem:
from tsfresh import extract_features
extracted_features = extract_features(df,
column_id="Hour",
column_kind=None,
column_value="Counts",
column_sort="time")
So, your target table will be:
Hour Feature_1 Feature_2 ... Visits(Avg)
0 min(Counts) max(Counts) ... mean(Visits)
1 min(Counts) max(Counts) ... mean(Visits)
2 min(Counts) max(Counts) ... mean(Visits)
min() and max() are just example features, tsfresh could extract many other features. Visit here for more information

Frequency in pandas timeseries index and statsmodel

I have a pandas timeseries y that does not work well with statsmodel functions.
import statsmodels.api as sm
y.tail(10)
2019-09-20 7.854
2019-10-01 44.559
2019-10-10 46.910
2019-10-20 49.053
2019-11-01 24.881
2019-11-10 52.882
2019-11-20 84.779
2019-12-01 56.215
2019-12-10 23.347
2019-12-20 31.051
Name: mean_rainfall, dtype: float64
I verify that it is indeed a timeseries
type(y)
pandas.core.series.Series
type(y.index)
pandas.core.indexes.datetimes.DatetimeIndex
From here, I am able to pass the timeseries through an autocorrelation function with no problem, which produces the expected output
plot_acf(y, lags=72, alpha=0.05)
However, when I try to pass this exact same object y to SARIMA
mod = sm.tsa.statespace.SARIMAX(y.mean_rainfall, order=pdq, seasonal_order=seasonal_pdq)
results = mod.fit()
I get the following error:
A date index has been provided, but it has no associated frequency information and so will be ignored when e.g. forecasting.
The problem is that the frequency of my timeseries is not regular (it is the 1st, 10th, and 20th of every month), so I cannot set freq='m'or freq='D' for example. What is the workaround in this case?
I am new to using timeseries, any advice on how to not have my index ignored during forecasting would help. This prevents any predictions from being possible
First of all, it is extremely important to understand what the relationship between the datetime column and the target column (rainfall) is. Looking at the snippet you provide, I can think of two possibilities:
y represents the rainfall that occurred in the date-range between the current row's date and the next row's date. If that is the case, the timeseries is kind of an aggregated rainfall series with unequal buckets of date i.e. 1-10, 10-20, 20-(end-of-month). If that is the case, you have two options:
You can disaggregate your data using either an equal weightage or even better an interpolation to create a continuous and relatively smooth timeseries. You can then fit your model on the daily time-series and generate predictions which will also naturally be daily in nature. These you can aggregate back to the 1-10, 10-20, 20-(end-of-month) buckets to get your predicitons. One way to do the resampling is using the code below.
ts.Date = pd.to_datetime(ts.Date, format='%d/%m/%y')
ts['delta_time'] = (ts['Date'].shift(-1) - ts['Date']).dt.days
ts['delta_rain'] = ts['Rain'].shift(-1) - ts['Rain']
ts['timesteps'] = ts['Date']
ts['grad_rain'] = ts['delta_rain'] / ts['delta_time']
ts.set_index('timesteps', inplace=True )
ts = ts.resample('d').ffill()
ts
ts['daily_rain'] = ts['Rain'] + ts['grad_rain']*(ts.index - ts['Date']).dt.days
ts['daily_rain'] = ts['daily_rain']/ts['delta_time']
print(ts.head(50))
daily_rain is now the target column and the index i.e. timesteps is the timestamp.
The other option is that you approximate that the date-range of 1-10, 10-20, 20-(EOM) is roughly 10 days, so these are indeed equal timesteps. Of course statsmodel won't allow that so you would need to reset the index to mock datetime for which you maintain a mapping. Below is what you use in the statsmodel as y but do maintain a mapping back to your original dates. Freq will 'd' or 'daily' and you would need to rescale seasonality as well such that it follows the new date scale.
y.tail(10)
2019-09-01 7.854
2019-09-02 44.559
2019-09-03 46.910
2019-09-04 49.053
2019-09-05 24.881
2019-09-06 52.882
2019-09-07 84.779
2019-09-08 56.215
2019-09-09 23.347
2019-09-10 31.051
Name: mean_rainfall, dtype: float64
I would recommend the first option though as it's just more accurate in nature. Also you can try out other aggregation levels also during model training as well as for your predictions. More control!
The second scenario is that the data represents measurements only for the date itself and not for the range. That would mean that technically you do not have enough info now to construct an accurate timeseries - your timesteps are not equidistant and you don't have enough info for what happened between the timesteps. However, you can still improvise and get some approximations going. The second approach listed above would still work as is. For the first approach, you'd need to do interpolation but given the target variable which is rainfall and rainfall has a lot of variation, I would highly discourage this!!
As I can see, the package uses the frequency as a premise for everything, since it's a time-series problem.
So you will not be able to use it with data of different frequencies. In fact, you will have to make an assumption for your analysis to adequate your data for the use. Some options are:
1) Consider 3 different analyses (1st days, 10th days, 20th days individually) and use 30d frequency.
2) As you have ~10d equally separated data, you can consider using some kind of interpolation and then make downsampling to a frequency of 1d. Of course, this option only makes sense depending on the nature of your problem and how quickly your data change.
Either way, I just would like to point out that how you model your problem and your data is a key thing when dealing with time series and data science in general. In my experience as a data scientist, I can say that is analyzing at the domain (where your data came from) that you can have a feeling of which approach will work better.

Python: Shift time series so they all match at a given y value

I'm writing my own code to analyse/visualise COVID-19 data from the European CDC.
https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
I've got a simple code to extract the data and make plots with cumulative deaths against time, and am trying to add functionality.
My aim is something like the attached graph, with all countries time shifted to match at (in this case the 5th death) I want to make a general bit of code to shift countries to match at the 'n'th death.
https://ourworldindata.org/grapher/covid-confirmed-deaths-since-5th-death
The current way I'm trying to do this is to have a maze of "if group is 'country' shift by ..." terms.
Where ... is a lookup to find the date for the particular 'country' when there were 'n' deaths, and to interpolate fractional dates where appropriate.
i.e. currently deaths are assigned as 00:00 day/month, but the data can be shifted by 2/3 a day as below.
datetime cumulative deaths
00:00 15/02 80
00:00 16/02 110
my '...' should give 16:00 15/02
I'm working on this right now but it doesn't feel very efficient and I'm sure there must be a much simpler way that I'm not seeing.
Essentially despite copious googling I can't seem to find a simple way of automatically shifting a bunch of timeseries to match at a particular y value, which feels like it should have some built-in functionality, i.e. a Lookup with interpolation.
####Live url (I've downloaded my own csv and been calling that for code development)
url = 'https://opendata.ecdc.europa.eu/covid19/casedistribution/csv'
dataraw = pd.read_csv(url)
#extract relevanty colums
data = dataraw.loc[:,["dateRep","countriesAndTerritories","deaths"]]
####convert date format
data['dateRep'] = pd.to_datetime(data['dateRep'],dayfirst=True)
####sort by date
data = data.sort_values(["dateRep"],ascending=True)
data['cumdeaths'] = data.groupby(['countriesAndTerritories']).cumsum()
##### limit to countries with cumulative deaths > 500
data = data.groupby('countriesAndTerritories').filter(lambda x:x['cumdeaths'].max() >500)
###### remove China from data for now as it doesn't match so well with dates
data = data.groupby('countriesAndTerritories').filter(lambda x:(x['countriesAndTerritories'] != "China").any())
##### only recent dates
data = data[data['dateRep'] > '2020-03-01']
print(data)
You can use groupby('country') and the pd.transform function to add a column which will set every row with the date in which its country hit the nth death.
Then you can do a vectorized subtraction of the date column and the new column to get the number of days.

Probability of occurrence based on historical data

The dataset is of occurrence of particular insects in a location for the given year and month. This is available for about 30 years. Now when I give a random location and year, month of future, I want what is the probability of finding that insects in that place based on the historic data.
I tried to to classification problem by labelling all available data as 1. And wanted to check the probability of new data point being label 1 . But the error was thrown as there should be at least two classes to train.
The data looks like this:The x and y are longitude and latitude
x y year month
17.01 22.87 2013 01
42.32. 33.09 2015 12
Think about the problem as a map. You'll need a map for each time period you're interested in, so sum all the occurrences in each month and year for each location. Unless the locations are already binned, you'll need to use some binning as otherwise it is pretty meaningless. So round the values in x and y to a reasonable precision level or use numpy to bin the data. Then you can create a map with the counts/ use a markov model to predict the occurrence.
The reason you're not getting anywhere at the moment is that the chance of finding an insect at any random point is virtually 0.

Categories