The project
I'm working on a small project where users can create events (such as Eat, Sleep, Watch a movie, etc.) and record log entries matching these events.
My data model looks like this (the application itself is in Python 3 / Django, but I don't think it matters here):
# a sample event
event = {
'id': 1,
'name': 'Eat',
}
# a sample entry
entry = {
'event_id': event['id'],
'user_id': 12,
# record date of the entry
'date': '2017-03-16T12:56:32.095465+00:00',
# user-given tags for the entry
'tags': ['home', 'delivery'],
# A user notation for the entry, it can be positive or negative
'score': 2,
'comment': 'That was a tasty meal',
}
Users can record an arbitrary number of entries for an arbitrary number of events, they can create new events when they need to. The data is stored in a relational database.
Now, I'd like to make data entry easier for users, by suggesting them relevant events when they visit the "Add an entry" form. At the moment, they can select the event corresponding to their entry in a dropdown, but I want to suggest them a few relevant events on top of that.
I was thinking that, given a user history (all the recorded entries), it should be possible to predict likely inputs, by identifying patterns in entries, e.g:
Eat usually happens every day, around noon and 7:00PM
Sleep usually occurs after 10:00 PM
Watch a movie usually occurs after on friday nights after 8:00 PM
Ideally, I'd like a function, that, given a user ID and a datetime, and using the user history, will return a list of events that are more likely to occur:
def get_events(user_id, datetime, max=3):
# implementation
# returns a list of up to max events
return events
So if I take the previous example (with more human dates), I would get the following results:
>>> get_events(user_id, 'Friday at 9:00 PM')
['Watch a movie', 'Sleep', 'Eat']
>>> get_events(user_id, 'Friday at 9:00 PM', max=2)
['Watch a movie', 'Sleep']
>>> get_events(user_id, 'Monday at 9:00 PM')
['Sleep', 'Eat', 'Watch a movie']
>>> get_events(user_id, 'Monday at noon')
['eat']
Of course, in the real life, I'll pass real datetimes, and I'd like to get an event ID so I can get corresponding data from the database.
My question
(Sorry, if it took some time to explain the whole thing)
My actual question is, what are the actual algorithms / tools / libraries required to achieve this? Is it even possible to do?
My current guess is that I'll need to use some fancy machine learning stuff, using something like scikit-learn and classifiers, feeding it with the user history to train it, then ask the whole thing to do its magic.
I'm not familiar at all with machine learning and I fear I don't have enough mathematical/scientific background to get started on my own. Can you provide me some reference material to help me understand how to solve this, the algorithms / vocabulary I have to dig in, or some pseudocode ?
I think a k-nearest neighbours (kNN) approach would be a good starting point. The idea in this specific case is to look for the k events that happen closest to the given time-of-day and count the ones that occur the most often.
Example
Say you have as input Friday at 9:00 PM. Take the distance of all
events in the database to this date and rank them in ascending order.
For example if we take the distance in minutes for all elements in the
database, an example ranking could be as follows.
('Eat', 34)
('Sleep', 54)
('Eat', 76)
...
('Watch a movie', 93)
Next you take the first k = 3 of those and compute how often they
occur,
('Eat', 2)
('Sleep', 1)
so that the function returns ['Eat', 'Sleep'] (in that order).
Choosing a good k is important. Too small values will allow accidental outliers (doing something once on a specific moment) to have a large influence on the result. Choosing k too large will allow for unrelated events to be included in the count. One way to alleviate this is by using distance weighted kNN (see below).
Choosing a distance function
As mentioned in the comments, using a simple distance between two timestamps might lose you some information such as day-of-week. We can solve this by making the distance function d(e1, e2) slightly more complex. In this case, we can choose it to be a tradeoff between time-of-day and day-of-week, e.g.
d(e1, e2) = a * |timeOfDay(e1) - timeOfDay(e2)| * (1/1440) +
b * |dayOfWeek(e1) - dayOfWeek(e2)| * (1/7)
where we normalised both differences by the maximum difference possible between times in a day (in minutes) and days of the week. a and b are parameters that can be used to give more weight to one of these differences. For example if we choose a = 3 and b = 1 we say that occurring at the same day is three times more important than occurring at the same time.
Distance weighted kNN
You can increase complexity (and hopefully performance) by not simply selecting the k closest elements but by assigning a weight (such as the distance) to all events according to their distance of the given point. Let e be the input example and o be an example from the database. Then we compute o's weight with respect to e as
1
w_o = ---------
d(e, o)^2
We see that points lose weight faster than their distance to e increases. In your case, a number of elements is then to be selected from the final ranking. This can be done by summing the weights for identical events to compute a final ranking of event types.
Implementation
The nice thing about kNN is that it is very easy to implement. You'll roughly need the following components.
An implementation of the distance function d(e1, e2).
A function that ranks all elements in the database according to this function and a given input example.
def rank(e, db, d):
""" Rank the examples in db with respect to e using
distance function d.
"""
return sorted([(o, d(e, o)) for o in db],
key=lambda x: x[1])
A function that selects some elements from this ranking.
Related
I am interested in learning if there has been published some type of code or package that can help me with the following problem:
An event takes place 30 times.
Each event can return 6 different values (0,1,2,3,4,5), each with their own unique probability.
I would like to estimate the probability of the total values -after all the scenarios have been simulated - is above X (e.g. 24).
The issue I have is that I can't - in a given event where the value is 3- multiply the probability of value 3*3 and add it together with the previous obtained values. Instead I need to simulate every single variation that is possible.
Is there any relatively simple solution to solve this issue?
First of all, what you're describing isn't scenario analysis. That said, Python can be used to estimate complex probabilities where an analytical solution might be hard or impossible to find.
Assuming an event takes place 30 times, with outcomes [0,1,2,3,4,5], and each outcome has a probability of occurring given by the list (for example) p =
[.1,.2,.2,.3,.1,.1], you can approximate the probability that the sum of all 30 events is greater than X with
import numpy as np
X = 80
np.mean([sum(np.random.choice(a=[0,1,2,3,4,5], size= 30,p=[.1,.2,.2,.3,.1,.1])) > X for i in range(10000)])
I would like to produce a random data of baby sleep time, but I want the random data to behave similarly (not necessarily equally) to the following graph:
(This is just an imaginary data, please don't conclude anything from this, specially not when your baby should sleep...)
The output that I want to produce is something like:
Baby name Sleep start Sleep end
Noah 2016/03/21 08:38 2016/03/21 09:28
Liam 2016/03/21 12:43 2016/03/21 15:00
Emma 2016/03/21 19:45 2016/03/22 06:03
So I thought I will create a weights table of time of day and weight (for the chance that a baby will sleep).
The question is how would I generate from this weights table a random data of a range of time to sleep?
(Think about if a baby start to sleep at around 8am, most likely he/she will wake in the next two hours and not continue to sleep more, and almost certainly won't sleep till 7am).
Is there another way you would build this (without the weights table)?
I prefer to build this in Python(3), but I would appreciate the general algorithm or lead to the solution.
Given the weights table data, you could use numpy.random.choice:
np.random.choice(list_of_times,
num_babies,
p=list_of_weights_for_those_times)
Without using a weights table, you would need to find the function that describes your distribution. Then see the answer to this question.
Let me start with answering the reverse of your question, since I misunderstood it; but it gave me the answer too.
Assume that, you already have a list of intervals dispersed around the 24 hours. You would like to find a number of the intervals that overlaps any given minute of the day; which you refer to as weight.
I can think of two approaches. But, first you should convert your time intervals into minutes, so the times in your list becomes:
# Note the 19:45-06:03 has been split into two intervals (1185,1440) and (0,363)
st = sorted(list(to_parts(sleep_table))
>>> [(0, 363), (518, 568), (763, 900), (1185, 1440)]
First, a simple solution will be to convert all intervals into a bunch of 1s and sum over all the intervals:
eod = 60*24
weights = reduce(lambda c,x: [l+r for l,r in zip(c, [0]*x[0] + [1]*(x[1]-x[0]) + [0]*(eod-x[1]))] ,st,[0]*eod)
This will give you a list of size 1440, where each entry is the weight for a given minute of the day.
Second, is a tiny bit more complex line sweep algorithm, which will give you the same values in O(nlogn) time for n segments. All you need is to just take the start and end times of the intervals and sort them, while keeping track of whether a time is a start or end time:
def start_end(st):
for t in st:
yield (t[0],1)
yield (t[1],-1)
sorted(list(start_end(st)))
#perform a line sweep to find changes in the weights
map(lambda (i,l):(i,sum(map(itemgetter(1),l))),groupby(sorted(list(start_end(st))), itemgetter(0)))
#compute the running sum of weights
#See question 35605014 for this part of the answer
Now, if you start from the weights themselves. You can easily convert it into a list of starts and end times, which are not coupled into intervals. All you need to do is to convert the smooth spline in the post into a step function. Whenever the step function increases in value, you add a sleep start time, and whenever it goes down you add a sleep stop time. Finally you perform a line sweep to match the sleep start time to a sleep end time. There is a bit of wiggle room here; as you can match any start time with any end time. If you want more data points, you can introduce additional sleep start and end times, as long as they are at the same point in time.
I have a DataFrame with the results of a marathon race, where each row represents a runner and columns include data like "Start Time" (timedelta), "Net Time" (timedelta), and Place (int). A scatter plot of the start time vs net time makes it easy to visually identifiy the different starting corrals (heats) in the race:
I'd like to analyze each heat separately, but I can't figure out how to divide them up. There are about 20,000 runners in the race. The start time spacings are not consistent, nor are the number of runners in a given corral
Gist of the code I'm using to organize the data:
https://gist.github.com/kellbot/1bab3ae83d7b80ee382a
CSV with about 500 results:
https://github.com/kellbot/raceresults/blob/master/Full/B.csv
There are lots of ways you can do this (including throwing scipy's k-means at it), but simple inspection makes it clear that there's at least 60 seconds between heats. So all we need to do is sort the start times, find the 60s gaps, and every time we find a gap assign a new heat number.
This can be done easily using the diff-compare-cumsum pattern:
starts = df["Start Time"].copy()
starts.sort()
dt = starts.diff()
heat = (dt > pd.Timedelta(seconds=60)).cumsum()
heat = heat.sort_index()
which correctly picks up the 16 (apparent) groups, here coloured by heat number:
If I understand correctly, you are asking for a way to algorithmically aggregate the Start Num values into different heats. This is a one dimensional classification/clustering problem.
A quick solution is to use one of the many Jenks natural breaks scripts. I have used drewda's version before:
https://gist.github.com/drewda/1299198
From inspection of the plot, we know there are 16 heats. So you can a priori select the number of classes to be 16.
k = jenks.getJenksBreaks(full['Start Num'].tolist(),16)
ax = full.plot(kind='scatter', x='Start Num', y='Net Time Sec', figsize=(15,15))
[plt.axvline(x) for x in k]
From your sample data, we see it does a pretty good job, but do the sparsity of observations fails to identify the break between the smallest Start Num bins:
I am having a hard time figuring out how to calculate when a satellite crosses a specific Longitude. It would be nice to able to provide a time period and a TLE and be able to return all the times at which the satellite crosses a given longitude during the specified time period. Does pyephem support something like this?
There are so many possible circumstances that users might ask about — when a satellite crosses a specific longitude; when it reaches a specific latitude; when it reaches a certain height or descends to its lowest altitude; when its velocity is greatest or least — that PyEphem does not try to provide built-in functions for all of them. Instead, it provides a newton() function that lets you find the zero-crossing of whatever comparison you want to make between a satellite attribute and a pre-determined value of that attribute that you want to search for.
Note that the SciPy Python library contains several very careful search functions that are much more sophisticated than PyEphem's newton() function, in case you are dealing with a particularly poorly-behaved function:
http://docs.scipy.org/doc/scipy/reference/optimize.html
Here is how you might search for when a satellite — in this example, the ISS — passes a particular longitude, to show the general technique. This is not the fastest possible approach — the minute-by-minute search, in particular, could be sped up if we were very careful — but it is written to be very general and very safe, in case there are other values besides longitude that you also want to search for. I have tried to add documentation and comments to explain what is going on, and why I use znorm instead of returning the simple difference. Let me know if this script works for you, and explains its approach clearly enough!
import ephem
line0 = 'ISS (ZARYA) '
line1 = '1 25544U 98067A 13110.27262069 .00008419 00000-0 14271-3 0 6447'
line2 = '2 25544 51.6474 35.7007 0010356 160.4171 304.1803 15.52381363825715'
sat = ephem.readtle(line0, line1, line2)
target_long = ephem.degrees('-83.8889')
def longitude_difference(t):
'''Return how far the satellite is from the target longitude.
Note carefully that this function does not simply return the
difference of the two longitudes, since that would produce a
terrible jagged discontinuity from 2pi to 0 when the satellite
crosses from -180 to 180 degrees longitude, which could happen to be
a point close to the target longitude. So after computing the
difference in the two angles we run degrees.znorm on it, so that the
result is smooth around the point of zero difference, and the
discontinuity sits as far away from the target position as possible.
'''
sat.compute(t)
return ephem.degrees(sat.sublong - target_long).znorm
t = ephem.date('2013/4/20')
# How did I know to make jumps by minute here? I experimented: a
# `print` statement in the loop showing the difference showed huge jumps
# when looping by a day or hour at a time, but minute-by-minute results
# were small enough steps to bring the satellite gradually closer to the
# target longitude at a rate slow enough that we could stop near it.
#
# The direction that the ISS travels makes the longitude difference
# increase with time; `print` statements at one-minute increments show a
# series like this:
#
# -25:16:40.9
# -19:47:17.3
# -14:03:34.0
# -8:09:21.0
# -2:09:27.0
# 3:50:44.9
# 9:45:50.0
# 15:30:54.7
#
# So the first `while` loop detects if we are in the rising, positive
# region of this negative-positive pattern and skips the positive
# region, since if the difference is positive then the ISS has already
# passed the target longitude and is on its way around the rest of
# the planet.
d = longitude_difference(t)
while d > 0:
t += ephem.minute
sat.compute(t)
d = longitude_difference(t)
# We now know that we are on the negative-valued portion of the cycle,
# and that the ISS is closing in on our longitude. So we keep going
# only as long as the difference is negative, since once it jumps to
# positive the ISS has passed the target longitude, as in the sample
# data series above when the difference goes from -2:09:27.0 to
# 3:50:44.9.
while d < 0:
t += ephem.minute
sat.compute(t)
d = longitude_difference(t)
# We are now sitting at a point in time when the ISS has just passed the
# target longitude. The znorm of the longitude difference ought to be a
# gently sloping zero-crossing curve in this region, so it should be
# safe to set Newton's method to work on it!
tn = ephem.newton(longitude_difference, t - ephem.minute, t)
# This should be the answer! So we print it, and also double-check
# ourselves by printing the longitude to see how closely it matches.
print 'When did ISS cross this longitude?', target_long
print 'At this specific date and time:', ephem.date(tn)
sat.compute(tn)
print 'To double-check, at that time, sublong =', sat.sublong
The output that I get when running this script suggests that it has indeed found the moment (within reasonable tolerance) when the ISS reaches the target longitude:
When did ISS cross this longitude? -83:53:20.0
At this specific date and time: 2013/4/20 00:18:21
To double-check, at that time, sublong = -83:53:20.1
There is a difference of time between the time the program calculates the passes over the longitude and the real time. I've checked it with the LIS system ( that it's inside the ISS ) of the Nasa to find lightnings.
And I have discovered that in Europe in some orbits the time that the program calculates the pass, it's 30 seconds in advanced than the real time. And in Colombia in some orbits the time in advanced is about 3 minutes ( perhaps because 1 degree of longitud in Colombia is bigger in amount of Km than 1 degree of longitude in Europe ). But this problem only happens in 2 particular orbits ! The one that pass over France and goes down in Sicilia. And the one that pass over USA, and goes down in Cuba.
Why could this be possible ?
In my pinion I think maybe there's some mistake in the ephem.newton algorithm or maybe with the TLE, that normally it reads the one created at 00:00:00 at night when it changes of day (and not the actual, because the ISS creates 3-4 TLE per day ) or maybe with the sat.sublong function that calculates a wrong Nadir of the satellite.
Does anyone has an idea or an explanation for this problem ?
Why it happens ?.
PS: I need to checked it for sure because I need to know when the ISS crosses an area ( for detecting the lightnings inside the area ). And if the time that the program calculates in some orbits it's in advanced than the real time, then the sat.sublong function calculates it's outside the area ( it calculates it hasn't arrived to the area yet ) but the program shows it's inside the area. So the real time doesn't match with the one that the program calculates, in some occasions.
Thanks a lot for your time !
I'm trying to compute item-to-item similarity along the lines of Amazon's "Customers who viewed/purchased X have also viewed/purchased Y and Z". All of the examples and references I've seen are for either computing item similarity for ranked items, for finding user-user similarity, or for finding recommended items based on the current users' history. I'd like to start off with a non-targeted approach before factoring in the current users' preferences.
Looking at the Amazon.com recommendations white paper, they use the following logic for offline item-item similarity:
For each item in product catalog, I1
For each customer C who purchased I1
For each item I2 purchased by customer C
Record that a customer purchased I1 and I2
For each item I2
Compute the similarity between I1 and I2
If I understand correctly, by the time we're at "Compute similiarty between I1 and I2", I have a list of items(I2) purchased in conjunction with a single value I1(the outer loop).
How is this calculation performed?
Another idea is that I'm overthinking this and making it more difficult than I need to - Would it be enough to do a top-n query on the count of I2 bought in conjunction with I1?
I also appreciate suggestions on whether or not this approach is a correct one. My product database has about 150k items at any time. Since the bulk of the reading material I've seen shows user-item similarity or even user-user similarity, should I be looking to go that route instead.
I've worked with similarity algorithms in the past but they've always involved a rank or a score. I think the only way this would work would be to build a customer-product matrix scoring 0/1 for not purchased/purchased. Given the purchase history and the item size, this could get really large.
edit: although i listed python as a tag, i'd prefer to keep the logic inside of a db, preferably using Oracle PL/SQL.
Let's understand Item-to-Item Collaborative Filtering.
suppose we have purchase matrix
Item1 Item2 ... ItemN
User1 0 1 ... 0
User2 1 1 ... 0
.
.
.
UserM 1 0 ... 0
Then we can calculate Item similarity using column vector, e.g use cosine. We have a item similarity symmetry matrix as below
Item1 Item2 ... ItemN
Item1 1 1/M ... 0
Item2 1/M 1 ... 0
.
.
.
ItemN 0 0 ... 1
It's can be explained as "Customers who viewed/purchased X have also viewed/purchased Y, Z, ..." (Collaborative Filtering). Because Item's vectorization is based on user's purchased.
Amazon's logic is exactly same with above while it's target is to improve efficient. As they said
We could build a product-to-product matrix by
iterating through all item pairs and com- puting a similarity metric
for each pair. However, many product pairs have no common customers,
and thus the approach is inefficient in terms of processing time and
memory usage. The iterative algorithm provides a better approach by
calculating the similarity between a single prod-uct and all related
products
There's a good O'Reilly book on this topic. While the whitepaper might lay the logic out in pseudo-code like that, I don't think that approach would scale very well. The calculations are all probability calculations, so things like Bayes' Theorem get used to say, "Given Person A purchased X, what's the likelihood they purchased Z?" Straightforward looping over the data is working too hard. You have to go through it all for each person.
#Neil or whoever comes to this question later on:
The choice of similarity metric is up to you and you might want to leave it malleable for the future. Check out the Wikipedia article on Frobenius norm for a start. Or as in the link you submitted, the Jaccard coefficient cos(I1,I2).
User-item –vs– user-user –vs– item-item, or whatever combination, cannot be answered objectively. It depends on what kind of data you can get from your users, how the UI draws information out of them, what parts of your data you consider reliable, and your own time constraints (as far as hybrids go).
Since many people have done masters theses on the questions above, you probably want to start with the easiest implementable solution while leaving room for growth in the complexity of the algorithm.
This may not be a perfect answer for your question but another way to look at this problem is Frequent Itemset Mining, which computes all the frequently co-purchased product pairs / groups given a minimum frequency threshold. And you can map a customer's purchase to its commonly co-purchased products.
There is no model training or Bayesian probability predicting because it's a pure math problem. Just need to count the frequency of all possible product pairs purchased together in your transaction base. It's an exponential search space but there are a lot of different efficient algorithms and implementations out there to use (SPMF is a very good one written in Java). This could work as a quick baseline model.