Detecting overlapping date recurrence rules - python

I'm working in a application that looks like Google Calendar, but with one main difference: events shouldn't have intersections with other events. This means that no two events may share common time, even in minutes granularity. This is specially useful for a calendar that only store meetings, since it is impossible to be at the same time in two meetings.
Just like Google Calendar, events may be created by using recurrence rules (every friday and sunday from 10 AM to 13 PM, for example). So I would like to detect overlapping events by only using rrules (of python-dateutil module), without needing to create N datetime objects and checking for intersection against each one.
Is it possible to detect overlapping dates by only using rrules? Is there anything similar already implemented in another library?

No, I don't believe it's possible to analyse a rrule to see if it can intersect another one without creating the datetime objects.
Essentially you're asking for the output of an algorithm without running the algorithm, and I think that's non-computable.
However, for certain types of rrule it is possible - e.g. a rrule of every Thursday can't intersect a rrule for every Tuesday. The problematical ones are days of the month and days of the year intersecting with days of week, and frequencies that never intersect.
The best bet would be to do the rules that are analytically checkable analytically, then for others generate the next year or so's data and compare manually.
The algorithm can run fast, since you can cache the existing occupied times as you add each rule.

Related

how to analyze numerical and categorical variables at the same time?

I'm trying to analyze the data of a food ordering application,
the data consist of both numerical and categorical variables, the main variable I'm studying is the total delivery time of an order, which represent the time from placing the order to closing it, I want to study what are the variables the affects it the most.
an example of rows in the data is the following:
order id
branch id
date
time placed
day
period
items id
no. items
total no. items
total delivery time
total time in seconds
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
571
4
11
00:46:19
2805
113113
31
2/2/2021
13:32:24
Tuesday
afternoon
573
4
11
00:46:19
2805
I want to study the effects of all the variables on the total time, even items id and branch id, does a certain item affect time? does the day and period of the day affect it as well?
I used linear regression to get the correlation between total time and the numerical variables, and tried one way anova for some categorical variables, but I didn't like the results, is there a way to analyze all variable together without encoding categorical variables?
I'm looking forward to seeing what other people say about this. Here's my two cents.
ML algos like Regression, love numbers. ML algos like Classification love labels (non-numbers). You can certainly convert labeled data to 'numbered' data. One example is to code ['red','green','blue'] with [1,2,3], would produce weird things like 'red' is lower than 'blue', and if you average a 'red' and a 'blue' you will get a 'green'. Another more subtle example might happen when you code ['low', 'medium', 'high'] with [1,2,3]. In the latter case it might happen to have an ordering which makes sense, however, some subtle inconsistencies might happen when 'medium' in not in the middle of 'low' and 'high'. Now, under the hood, I think classifiers convert labels to numbers, so if you feed in large, medium, and small, it isn't using large, medium, and small to do it's analysis, it's converting those categories to numbers. I think. Maybe someone can confirm this for me.
Thus, I don't think it makes sense to try to measure any kind of relationship between IDs and specific outcomes, like 'totaltime', 'totaldays', etc. If you kick off a project on a Monday or a Friday, does the project end sooner or later than non-Monday-start or non-Friday-start projects? Well, maybe it does. But, is that correlation or causation? You can find correlations between all kinds of things, but these don't necessarily imply causation between these same things. Let's say you find a strong relationship between multiple projects that start on the second Monday of the month and all of these projects get finished off much faster than all other projects. This seems like pure coincidence, rather than causation. Or, there is some other factor impacting the outcome. Maybe projects that start on the second Monday of the month are typically small upgrades, rather than full-blown new undertakings, so the volume of work is less, and the project is done faster. However, starting the work on the second Monday of the month doesn't CAUSE the project to be finished off faster. Tell me if I am wrong. I'm always open to feedback.

Clustering and Distance/Dissimilarity Matrrix based on String/Integer sequences in Python

I have customer's data based on his stay in the shop. The shop has 4 zones; zone 1,2,3 and 4. Now every 2 minutes, I get his reading as 10 numbers based on which zone he is in. EX:
1-1-1-1-1-1-1-1-3-3-2
4-4-3-3-3-3-3-2-1-3-3
3-4-1-2-2-3-1-4-2-1-4
Basically, I expect that there are customers who mostly are in a particular zone and they are clustered accordingly. So, in the first sequence, the customer seems to prefer zone 1, the next one zone 3 and the last one is like noise.
All I am feeding to the program is a bunch of sequences (unlabeled). How do I generate a distance/dissimilarity matrix that calculates the distances between each sequence in Python?
After a little bit of digging, I came across the textdistance library in python.
https://pypi.org/project/textdistance/
It seems to be working well for this problem, even though my input is a sequence of integers.
You can use cosine or euclidean distances to calculate the distance.
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.cosine.html
https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.euclidean_distances.html

optmization of a scheduling problem in python

I am trying to learn on scheduling and have the following use case:
I have different parts that need to be delivered on specific dates, also they have different quantity and different runtimes. Only 1 machine is considered. The delivery date is a hard constraint, but I also would like to see if I can optimize the setup of the machine for each product. Therefore I have table with the different tools used for the parts. When the cell has a 0 the tool is not used, when there is a 1 the tool is used. I have around 50 tools in total for all parts. do not want to look only at the delivery dates, I also want to look how I can shorten time between the change from part A to part B, so that I do change as less as tools as possible.
I was able to sort my data after the date, but do not know where I should start to go to optimize, which algorithm might be good, a genetic algorithm or ant colony optimization ? I can not provide a code yet and also do not want one whole code from here, but a good starting point is my interest.

Best graph type for clusters of web requests in time period

I'm writing a short Python script to try and visualise some of our Apache logs using matplotlib, to get an idea of the kinds of requests being made, and the users doing so.
Parsing the logs into a DB format for easy querying was simple enough, however I'm currently wondering what the best kind of graph to use would be if I'm looking for clusters of data - if say, one user is performing a lot of requests one after another with different timestamps, for example, this may present a fairly constant, but low line on a line graph or scatter diagram, but I'd like to make it more visually obvious that the user is making regular requests during a period of time.
If it was a pure count of the number of hits a user is making, it wouldn't be an issue, as a bar graph would suffice, but I'm at a loss as to how I can relate those hits around a time period, without specifying ranges of time periods in my initial query.
Anyone unfamiliar with the graph types matplotlib/pyplot offers can see a range of them here: http://matplotlib.org/gallery.html
Suggestions from any of the data visualisation veterans out there are most appreciated!
You can use bubbles to indicate how many user counts over an interval in your timeline. Bigger bubble means more hits. Rank your users based on total hits so the most active ones appear first. It's kinda like bar chart but you use bubbles to indicate counts.
Something like this:
http://neuralengr.com/asifr/journals/

Scheduling: Minimizing Gaps between Non-Overlapping Time Ranges

Using Django to develop a small scheduling web application where people are assigned certain times to meet with their superiors. Employees are stored as models, with a OneToMany relation to a model representing time ranges and day of the week where they are free. For instance:
Bob: (W 9:00, 9:15), (W 9:15, 9:30), ... (W 15:00, 15:20)
Sarah: (Th 9:05, 9:20), (F 9:20, 9:30), ... (Th 16:00, 16:05)
...
Mary: (W 8:55, 9:00), (F 13:00, 13:35), ... etc
My program allows a basic schedule setup, where employers can choose to view the first N possible schedules with the least gaps in between meetings under the condition that they meet all their employees at least once during that week. I am currently generating all possible permutations of meetings, and filtering out schedules where there are overlaps in meeting times. Is there a way to generate the first N schedules out of M possible ones, without going through all M possibilities?
Clarification: We are trying to get the minimum sum of gaps for any given day, summed over all days.
I would use a search algorithm, like A-star, to do this. Each node in the graph represents a person's available time slots and a path from one node to another means that node_a and node_b are in the partial schedule.
Another solution would be to create a graph in which the nodes are each person's availability times and there is a edge from node_a to node_b if the person associated with node_a is not the same as the person associated with node_b. The weight of each node is the amount of time between the time associated with the two nodes.
After creating this graph, you could generate a variant of a minimum spanning tree from the graph. The variant would differ from MSTs in that:
you'll only add a node to the MST if the person associated with the node is not already in the MST.
you finish creating the MST when all persons are in the MST.
The minimum spanning tree generated would represent a single schedule.
To generate other schedules, remove all the edges from the graph which are found in the schedule you just created and then create a new minimum spanning tree from the graph with the removed edges.
In general, scheduling problems are NP-hard, and while I can't figure out a reduction for this problem to prove it such, it's quite similar to a number of other well-known NP-complete problems. There may be a polynomial-time solution for finding the minimum gap for a single day (though I don't know it off hand, either), but I have less hopes for needing to solve it for multiple days. Unfortunately, it's a complicated problem, and there may not be a perfectly elegant answer. (Or, I'm going to kick myself when someone posts one later.)
First off, I'd say that if your dataset is reasonably small and you've been able to compute all possible schedules fairly quickly, you may just want to stick with that solution, as all others will be approximations, and could possibly end up running slower, if the constant factor of their running time is large. (Meaning that it doesn't grow with the size of the dataset, so it will relatively be smaller for a large dataset.)
The simplest approximation would be to just use a greedy heuristic. It will almost assuredly not find the optimum schedules, and may take a long time to find a solution if most of the schedules are overlapping, and there are only a few that are even valid solutions - but I'm going to assume that this is not the case for employee times.
Start with an arbitrary schedule, choosing one timeslot for each employee at random. For each iteration, pick one employee and change his timeslot to the best possible time, with respect to the rest of current schedule. Repeat this process until your satisfied with the result - when it isn't improving quickly enough anymore or has taken too long. You're probably not going to want to repeat until you can't make any more changes that improve the schedule, since this process will likely loop for most data.
It's not a great heuristic, but it should produce some reasonable schedules, and has a lot of room for adjustment. You may want to always try to switch overlapping times first before any others, or you may want to try to flip the employee who currently contributes to the largest gap, or maybe eliminate certain slots that you've already tried. You may want to sometimes allow a move to a less optimal solution in hopes that you're at a local minima and want to get out of it - some randomness can also help with this. Make sure you always keep track of the best solution you've seen so far.
To generate more schedules, the most obvious thing would be to just start the process over with a different random schedule. Or, maybe flip a few arbitrary times from the previous solution you found, and repeat from there.
Edit: This is all fairly related to genetic algorithms, and you may want to use some of the ideas I presented here in a GA.

Categories