I've got a sqlite database of music tracks, and I want to remove duplicates. I'd like to compare tracks based on title and duration. (I'll probably try to throw artists in later, but that's a separate table (multiple artists per track), but for now, I've got a text field for the title and an integer field for the duration (in seconds).) Duplicate tracks in this database tend to have similar titles (or at least with similar prefixes) and durations within 5-10 seconds of each other.
I'm trying to learn recordlinkage to detect the duplicates, and my first attempt was to make a full index, use Smith-Waterman to compare titles and a simple linear numeric comparison for the duration. No big surprise; the database was WAAY too big to do a full index. I could do a block index on the duration to limit down to pairs to durations that are identical, but the durations are often off by a few seconds. I could do sorted neighborhood, but if* I'm understanding this correctly (*a big "if"), that means that if I set a window to (for example) 10, each track will only be paired with the 10 closest tracks in terms of duration, which will pretty much always be identical durations and completely miss the durations that are close but not identical. It seems to me like having an "approximate blocking index" or something like that would be a natural step, but I can't seem to find any simple way to do that.
Can anyone help me out here?
Okay, answering my own question here because I believe I've figured out the misunderstanding in my original question.
I was misunderstanding how sorted neighborhood indexing works. I was thinking that if you set the window to (for example) 3, it would sort all the records by the key and then pair each record with exactly 3 neighbor records (the record itself, the one above it, and the one below it). So if there were more than 5 records with the same key value, this would actually result in fewer pairs than a block index. But I'm now pretty sure that it's actually grouping the values by the key first, so that a window of 3 will pair with all records with the exact same key value, all the records with the next highest key value and all the records with the next lowest key value.
Now this doesn't get me exactly what I asked for, but it gets me close enough. If I set a window size of 11 (or 21), then I'll be guaranteed to get all values within 5 seconds (or 10 seconds). If the data is sparse with respect to duration, there will be a bit more. (And this only works because it's integer data. If it were floating point numbers of arbitrary precision, then that would be a different matter.)
I have an excel, where a start and end time is given of a particular user and I have to take out the difference using python, The doubt I have is the start and the end time is given one below the another.
I am confused how do I separate the start and end time of user and then calculate the difference.
snap of the data
what steps should I take or the logic I should use?
This video should help along with the functionality of your code here
For logic, you could use a for loop to parse through each line using for i in range(sheet.nrows) or something near that and every time it runs the loop add a number to the counter, if the numbers odd save the value to a start time list or dictand if the values even save the value to an end time list or dict
I have a list of objects that each contain: Product, Price, Time. It basically holds product price change over time. So, for each price change you'll get a record with the product name, the new price and the exact second of the change.
I'm sending the data to graphite using the timestamp (written in python):
import socket
sock = socket.socket()
sock.connect(('my-host-ip', 2003))
message = 'my.metric.prefix.%s %s %d\n' % (product, price, timestamp)
sock.sendall(message)
sock.close()
Thing is, as prices do not change very often, the data points are very sparsed which mean I get a point per product in a frequency of hours/days. If I look at Graphite at the exact time of the price change, I can see the data point. But if I want to look at price change over time, I would like to draw a constant line from the data point of the price change going forward.
I tried using:
keepLastValue(my.metric.prefix.*)
I would work only if I look at the data points in a time frame of a few minutes, but not hours (surely not days). Is there a way to do something like that in Graphite? Or I have to put some redundant data every minute to describe the missing points?
I believe using keepLastValue doesn't work for you for a coarser time interval due to aggregation rules defined in storage-aggregation.conf. You can try using xFilesFactor = 0 and aggregationMethod = last to get always the last value of the metric at each aggregated point.
However I think your concrete use case is much better resolved by using StatsD gauges. Basically you can set an arbitrary numerical value for a gauge in StatsD and it will send (flush) its value to Graphite every 10 seconds by default. You can set the flush interval to a shorter period, like 1 second, if you really need to record the second of the change. If the gauge is not updated at the next flush, StatsD will send the previous value.
So basically StatsD gauges do what you say about sending redundant data to describe missing points.
I had the same problem as well with sparse data.
I used the whisper database tools outlined in the link below to update my whisper files which were aggregating load data on 10 minute intervals.
https://github.com/graphite-project/whisper
First examine my file using whisper-info:
/opt/graphite/bin/whisper-info.py /opt/graphite/storage/whisper/test/system/myserver/n001/loadave.wsp
Then fix aggregation methods using whisper-resize:
/opt/graphite/bin/whisper-resize.py --xFilesFactor=0 --aggregationMethod=last /opt/graphite/storage/whisper/test/system/myserver/n001/loadave.wsp 600:259200
Please be cautious using whisper-resize as it can result in data loss if you aren't careful!
I have a live feed of logging data coming in through the network. I need to calculate live statistics, like the one in my previous question. How would I design this module? I mean, it seems unrealistic (read, bad design) to keep applying a groupby function to the entire df every single time a message arrives. Can I just update one row and its calculated column gets auto-updated?
JFYI, I'd be running another thread that will print read values from the df and print to the a webpage every 5 seconds or so..
Of course, I could run groupby-apply every 5 seconds instead of doing it in real time, but I thought it'd be better to keep the df and the calculation independent of the printing module.
Thoughts?
groupby is pretty damn fast, and if you preallocate slots for new items you can make it even faster. In other words, try it and measure it for a reasonable amount of fake data. If it's fast enough, use pandas and move on. You can always rewrite it later.
Given a time (eg. currently 4:24pm on Tuesday), I'd like to be able to select all businesses that are currently open out of a set of businesses.
I have the open and close times for every business for every day of the week
Let's assume a business can open/close only on 00, 15, 30, 45 minute marks of each hour
I'm assuming the same schedule each week.
I am most interested in being able to quickly look up a set of businesses that is open at a certain time, not the space requirements of the data.
Mind you, some my open at 11pm one day and close 1am the next day.
Holidays don't matter - I will handle these separately
What's the most efficient way to store these open/close times such that with a single time/day-of-week tuple I can speedily figure out which businesses are open?
I am using Python, SOLR and mysql. I'd like to be able to do the querying in SOLR. But frankly, I'm open to any suggestions and alternatives.
If you are willing to just look at single week at a time, you can canonicalize all opening/closing times to be set numbers of minutes since the start of the week, say Sunday 0 hrs. For each store, you create a number of tuples of the form [startTime, endTime, storeId]. (For hours that spanned Sunday midnight, you'd have to create two tuples, one going to the end of the week, one starting at the beginning of the week). This set of tuples would be indexed (say, with a tree you would pre-process) on both startTime and endTime. The tuples shouldn't be that large: there are only ~10k minutes in a week, which can fit in 2 bytes. This structure would be graceful inside a MySQL table with appropriate indexes, and would be very resilient to constant insertions & deletions of records as information changed. Your query would simply be "select storeId where startTime <= time and endtime >= time", where time was the canonicalized minutes since midnight on sunday.
If information doesn't change very often, and you want to have lookups be very fast, you could solve every possible query up front and cache the results. For instance, there are only 672 quarter-hour periods in a week. With a list of businesses, each of which had a list of opening & closing times like Brandon Rhodes's solution, you could simply, iterate through every 15-minute period in a week, figure out who's open, then store the answer in a lookup table or in-memory list.
The bitmap field mentioned by another respondent would be incredibly efficient, but gets messy if you want to be able to handle half-hour or quarter-hour times, since you have to increase arithmetically the number of bits and the design of the field each time you encounter a new resolution that you have to match.
I would instead try storing the values as datetimes inside a list:
openclosings = [ open1, close1, open2, close2, ... ]
Then, I would use Python's "bisect_right()" function in its built-in "bisect" module to find, in fast O(log n) time, where in that list your query time "fits". Then, look at the index that is returned. If it is an even number (0, 2, 4...) then the time lies between one of the "closed" times and the next "open" time, so the shop is closed then. If, instead, the bisection index is an odd number (1, 3, 5...) then the time has landed between an opening and a closing time, and the shop is open.
Not as fast as bitmaps, but you don't have to worry about resolution, and I can't think of another O(log n) solution that's as elegant.
You say you're using SOLR, don't care about storage, and want the lookups to be fast. Then instead of storing open/close tuples, index an entry for every open block of time at the level of granularity you need (15 mins). For the encoding itself, you could use just cumulative hours:minutes.
For example, a store open from 4-5 pm on Monday, would have indexed values added for [40:00, 40:15, 40:30, 40:45]. A query at 4:24 pm on Monday would be normalized to 40:15, and therefore match that store document.
This may seem inefficient at first glance, but it's a relatively small constant penalty for indexing speed and space. And makes the searches as fast as possible.
Sorry I don't have an easy answer, but I can tell you that as the manager of a development team at a company in the late 90's we were tasked with solving this very problem and it was HARD.
It's not the weekly hours that's tough, that can be done with a relatively small bitmask (168 bits = 1 per hour of the week), the trick is the businesses which are closed every alternating Tuesday.
Starting with a bitmask then moving on to an exceptions field is the best solution I've ever seen.
In your Solr index, instead of indexing each business as one document with hours, index every "retail session" for every business during the course of a week.
For example if Joe's coffee is open Mon-Sat 6am-9pm and closed on Sunday, you would index six distinct documents, each with two indexed fields, "open" and "close". If your units are 15 minute intervals, then the values can range from 0 to 7*24*4. Assuming you have a unique ID for each business, store this in each document so you can map the sessions to businesses.
Then you can simply do a range search in Solr:
open:[* TO N] AND close:[N+1 TO *]
where N is computed to the Nth 15 minute interval that the current time falls into. For examples if it's 10:10AM on Wednesday, your query would be:
open:[* TO 112] AND close:[113 TO *]
aka "find a session that starts at or before 10:00am Wed and ends at or after 10:15am Wed"
If you want to include other criteria in your search, such as location or products, you will need to index this with each session document as well. This is a bit redundant, but if your index is not huge, it shouldn't be a problem.
If you can control your data well, I see a simple solution, similar to #Sebastian's. Follow the advice of creating the tuples, except create them of the form [time=startTime, storeId] and [time=endTime, storeId], then sort these in a list. To find out if a store is open, simply do a query like:
select storeId
from table
where time <= '#1'
group by storeId
having count(storeId) % 2 == 1
To optimize this, you could build a lookup table at each of time t, store the stores that are open at t, and the store openings/closings between t and t+1 (for any grouping of t).
However, this has the drawback of being harder to maintain (overlapping openings/closings need to be merged into a longer open-close period).
Have you looked at how many unique open/close time combinations there are? If there are not that many, make a reference table of the unique combinations and store the index of the appropriate entry against each business. Then you only have to search the reference table and then find the business with those indices.