Paginating requests to an API - python

I'm consuming (via urllib/urllib2) an API that returns XML results. The API always returns the total_hit_count for my query, but only allows me to retrieve results in batches of, say, 100 or 1000. The API stipulates I need to specify a start_pos and end_pos for offsetting this, in order to walk through the results.
Say the urllib request looks like http://someservice?query='test'&start_pos=X&end_pos=Y.
If I send an initial 'taster' query with lowest data transfer such as http://someservice?query='test'&start_pos=1&end_pos=1 in order to get back a result of, for conjecture, total_hits = 1234, I'd like to work out an approach to most cleanly request those 1234 results in batches of, again say, 100 or 1000 or...
This is what I came up with so far, and it seems to work, but I'd like to know if you would have done things differently or if I could improve upon this:
hits_per_page=100 # or 1000 or 200 or whatever, adjustable
total_hits = 1234 # retreived with BSoup from 'taster query'
base_url = "http://someservice?query='test'"
startdoc_positions = [n for n in range(1, total_hits, hits_per_page)]
enddoc_positions = [startdoc_position + hits_per_page - 1 for startdoc_position in startdoc_positions]
for start, end in zip(startdoc_positions, enddoc_positions):
if end > total_hits:
end = total_hits
print "url to request is:\n ",
print "%s&start_pos=%s&end_pos=%s" % (base_url, start, end)
p.s. I'm a long time consumer of StackOverflow, especially the Python questions, but this is my first question posted. You guys are just brilliant.

I'd suggest using
positions = ((n, n + hits_per_page - 1) for n in xrange(1, total_hits, hits_per_page))
for start, end in positions:
and then not worry about whether end exceeds hits_per_page unless the API you're using really cares whether you request something out of range; most will handle this case gracefully.
P.S. Check out httplib2 as a replacement for the urllib/urllib2 combo.

It might be interesting to use some kind of generator for this scenario to iterate over the list.
def getitems(base_url, per_page=100):
content = ...urllib...
total_hits = get_total_hits(content)
sofar = 0
while sofar < total_hits:
items_from_next_query = ...urllib...
for item in items_from_next_query:
sofar += 1
yield item
Mostly just pseudo code, but it could prove quite useful if you need to do this many times by simplifying the logic it takes to get the items as it only returns a list which is quite natural in python.
Save you quite a bit of duplicate code also.

Related

Mime type optimisation in python

I want to solve the mime challenge in coding games.com. My code can pass all the test but not the optimisation test.
I tried to remove all useless functions like parsing to string but the problem is on the way I think about it.
import sys
import math
# Auto-generated code below aims at helping you parse
# the standard input according to the problem statement.
n = int(input()) # Number of elements which make up the association table.
q = int(input())
# Number Q of file names to be analyzed.
dico = {}
# My function
def check(word):
for item in dico:
if(word[-len(item)-1:].upper() == "."+item.upper()):
return(dico[item])
return("UNKNOWN")
for i in range(n):
# ext: file extension
# mt: MIME type.
ext, mt = input().split()
dico[ext] = mt
for i in range(q):
fname = input()
fname = fname
print(check(fname))
# Write an action using print
# To debug: print("Debug messages...", file=sys.stderr)
#print("Debug message...", file=sys.stderr)
Failure
Process has timed out. This may mean that your solution is not optimized enough to handle some cases.
This is the right idea, but one detail appears to be destroying the performance. The problem is the line for item in dico:, which unnecessarily loops over every entry in the dictionary. This is a linear search O(n), checking for the target item-by-item. But this pretty much defeats the purpose of the dictionary data structure, which is to offer constant-time O(1) lookups. "Constant time" means that no matter how big the dictionary gets, the time it takes to find an item is always the same (thanks to hashing).
To draw a metaphor, imagine you're looking for a spoon in your kitchen. If you know where all the utensils, appliances and cookware are are ahead of time, you don't need to look in every drawer to find the utensils. Instead, you just go straight to the utensils drawer containing the spoon you want, and it's one-shot!
On the other hand, if you're in someone else's kitchen, it can be difficult to find a spoon. You have to start at one end of the cupboard and check every drawer until you find the utensils. In the worst-case, you might get unlucky and have to check every drawer before you find the utensil drawer.
Back to the code, the above snippet is using the latter approach, but we're dealing with trying to find something in 10k unfamiliar kitchens each with 10k drawers. Pretty slow, right?
If you can adjust the solution to check the dictionary in constant time, without a loop, then you can handle n = 10000 and q = 10000 without having to make q * n iterations (you can do it in q iterations instead--so much faster!).
Thank you for your help,
I figured out the solution.
n = int(input()) # Number of elements which make up the association table.
q = int(input()) # Number Q of file names to be analyzed.
dico = {}
# My function
def check(word):
if("." in word):
n = len(word)-(word.rfind(".")+1)
extension = word[-n:].lower()
if(extension in dico):
return(dico[extension])
return("UNKNOWN")
for i in range(n):
# ext: file extension
# mt: MIME type.
ext, mt = input().split()
dico[ext.lower()] = mt
for i in range(q):
fname = input()
print(check(fname))
Your explanation was clear :D
Thank you

Tracking how many elements processed in generator

I have a problem in which I process documents from files using python generators. The number of files I need to process are not known in advance. Each file contain records which consumes considerable amount of memory. Due to that, generators are used to process records. Here is the summary of the code I am working on:
def process_all_records(files):
for f in files:
fd = open(f,'r')
recs = read_records(fd)
recs_p = (process_records(r) for r in recs)
write_records(recs_p)
My process_records function checks for the content of each record and only returns the records which has a specific sender. My problem is the following: I want to have a count on number of elements being returned by read_records. I have been keeping track of number of records in process_records function using a list:
def process_records(r):
if r.sender('sender_of_interest'):
records_list.append(1)
else:
records_list.append(0)
...
The problem with this approach is that records_list could grow without bounds depending upon the input. I want to be able to consume the content of records_list once it grows to certain point and then restart the process. For example, after 20 records has been processed, I want to find out how many records are from 'sender_of_interest' and how many are from other sources and empty the list. Can I do this without using a lock?
You could make your generator a class with an attribute that contains a count of the number of records it has processed. Something like this:
class RecordProcessor(object):
def __init__(self, recs):
self.recs = recs
self.processed_rec_count = 0
def __call__(self):
for r in self.recs:
if r.sender('sender_of_interest'):
self.processed_rec_count += 1
# process record r...
yield r # processed record
def process_all_records(files):
for f in files:
fd = open(f,'r')
recs_p = RecordProcessor(read_records(fd))
write_records(recs_p)
print 'records processed:', recs_p.processed_rec_count
Here's the straightforward approach. Is there some reason why something this simple won't work for you?
seen=0
matched=0
def process_records(r):
seen = seen + 1
if r.sender('sender_of_interest'):
matched = match + 1
records_list.append(1)
else:
records_list.append(0)
if seen > 1000 or someOtherTimeBasedCriteria:
print "%d of %d total records had the sender of interest" % (matched, seen)
seen = 0
matched = 0
If you have the ability to close your stream of messages and re-open them, you might want one more total seen variable, so that if you had to close that stream and re-open it later, you could go to the last record you processed and pick up there.
In this code "someOtherTimeBasedCriteria" might be a timestamp. You can get the current time in milliseconds when you begin processing, and then if the current time now is more than 20,000ms more (20 sec) then reset the seen/matched counters.

Best way to limit incoming messages to avoid duplicates

I have a system that accepts messages that contain urls, if certain keywords are in the messages, an api call is made with the url as a parameter.
In order to conserve processing and keep my end presentation efficient.
I don't want duplicate urls being submitted within a certain time range.
so if this url ---> http://instagram.com/p/gHVMxltq_8/ comes in and it's submitted to the api
url = incoming.msg['urls']
url = urlparse(url)
if url.netloc == "instagram.com":
r = requests.get("http://api.some.url/show?url=%s"% url)
and then 3 secs later the same url comes in, I don't want it submitted to the api.
What programming method might I deploy to eliminate/limit duplicate messages from being submitted to the api based on time?
UPDATE USING TIM PETERS METHOD:
limit = DecayingSet(86400)
l = limit.add(longUrl)
if l == False:
pass
else:
r = requests.get("http://api.some.url/show?url=%s"% url)
this snippet is inside a long running process, that is accepting streaming messages via tcp.
every time I pass the same url in, l returns True every time.
But when I try it in the interpreter everything is good, it returns False when the set time hasn't expired.
Does it have to do with the fact that the script is running, while the set is being added to?
Instance issues?
Maybe overkill, but I like creating a new class for this kind of thing. You never know when requirements will get fancier ;-) For example,
from time import time
class DecayingSet:
def __init__(self, timeout): # timeout in seconds
from collections import deque
self.timeout = timeout
self.d = deque()
self.present = set()
def add(self, thing):
# Return True if `thing` not already in set,
# else return False.
result = thing not in self.present
if result:
self.present.add(thing)
self.d.append((time(), thing))
self.clean()
return result
def clean(self):
# forget stuff added >= `timeout` seconds ago
now = time()
d = self.d
while d and now - d[0][0] >= self.timeout:
_, thing = d.popleft()
self.present.remove(thing)
As written, it checks for expirations whenever an attempt is made to add a new thing. Maybe that's not what you want, but it should be a cheap check since the deque holds items in order of addition, so gets out at once if no items are expiring. Lots of possibilities.
Why a deque? Because deque.popleft() is a lot faster than list.pop(0) when the number of items becomes non-trivial.
suppose your desired interval is 1 hour, keep 2 counters that increment every hour but they are offset 30 minutes from each other. i. e. counter A goes 1, 2, 3, 4 at 11:17, 12:17, 13:17, 14:17 and counter B goes 1, 2, 3, 4 at 11:47, 12:47, 13:47, 14:47.
now if a link comes in and has either of the two counters same as an earlier link, then consider it to be duplicate.
the benefit of this scheme over explicit timestamps is that you can hash the url+counterA and url+counterB to quickly check whether the url exists
Update: You need two data stores: one, a regular database table (slow) with columns: (url, counterA, counterB) and two, a chunk of n bits of memory (fast). given a url so.com, counterA 17 and counterB 18, first hash "17,so.com" into a range 0 to n - 1 and see if the bit at that address is turned on. similarly, hash "18,so.com" and see if the bit is turned on.
If the bit is not turned on in either case you are sure it is a fresh URL within an hour, so we are done (quickly).
If the bit is turned on in either case then look up the url in the database table to check if it was that url indeed or some other URL that hashed to the same bit.
Further update: Bloom filters are an extension of this scheme.
I'd recommend keeping an in-memory cache of the most-recently-used URLs. Something like a dictionary:
urls = {}
and then for each URL:
if url in urls and (time.time() - urls[url]) < SOME_TIMEOUT:
# Don't submit the data
else:
urls[url] = time.time()
# Submit the data

Better way to control flow around database query

In my program I need to read a very large table (it exceeds memory storage) and have myself writing the following construct to read from the table and do some work. While I know it's very possible to re-write the select into an iterator style it still has the basic structure that is follows:
found = True
start = 0
batch = 2500
while found:
found = False
for obj in Channel.select().limit(batch).offset(start):
found = True
# do some big work...
start += batch
What I would like to do is have something that don't carry around as many klunky state variables. Ideas of how to clean up this bit of mess?
FYI - I've tried this as well, not sure I like it any better:
#staticmethod
def qiter(q, start=0, batch=25000):
obj = True
while obj:
for obj in q.limit(batch).offset(start):
yield obj
start += batch
The shortest thing I found is the following:
for start in itertools.count(0, 2500):
objs = Channel.select().limit(2500).offset(start)
if not objs:
break
for obj in objs:
# do some big work...
Basically it's a combination of two things:
the count iterator (from the itertools package of the standard library) reduces the batch counting to a minimum, and
using a separate test and the break statement to get out of it.
In detail:
The count iterator is pretty simple: it yield the infinite series 0, 2500, 5000, 7500, ... . As this loop would never end, we need to break out of it somewhere. This is where the if-statement comes into play. If objs is an empty list, the break exists the outer loop.
if you are just iterating and don't want to use up all your RAM you might check out the "iterator()" method on QueryResultWrapper.
For instance, say you need to iterate over 1,000,000 rows of data and serialize it:
# let's assume we've got 1M stat objects to dump to csv
stats_qr = Stat.select().execute()
# our imaginary serializer class
serializer = CSVSerializer()
# loop over all the stats and serialize
for stat in stats_qr.iterator():
serializer.serialize_object(stat)
Below is a link to the documentation:
http://peewee.readthedocs.org/en/latest/peewee/cookbook.html#iterating-over-lots-of-rows

How can I make my code more efficient?

I have a list of tuples that contains a tool_id, a time, and a message. I want to select from this list all the elements where the message matches some string, and all the other elements where the time is within some diff of any matching message for that tool.
Here is how I am currently doing this:
# record time for each message matching the specified message for each tool
messageTimes = {}
for row in cdata: # tool, time, message
if self.message in row[2]:
messageTimes[row[0], row[1]] = 1
# now pull out each message that is within the time diff for each matched message
# as well as the matched messages themselves
def determine(tup):
if self.message in tup[2]: return True # matched message
for (tool, date_time) in messageTimes:
if tool == tup[0]:
if abs(date_time-tup[1]) <= tdiff:
return True
return False
cdata[:] = [tup for tup in cdata if determine(tup)]
This code works, but it takes way too long to run - e.g. when cdata has 600,000 elements (which is typical for my app) it takes 2 hours for this to run.
This data came from a database. Originally I was getting just the data I wanted using SQL, but that was taking too long also. I was selecting just the messages I wanted, then for each one of those doing another query to get the data within the time diff of each. That was resulting in tens of thousands of queries. So I changed it to pull all the potential matches at once and then process it in python, thinking that would be faster. Maybe I was wrong.
Can anyone give me some suggestions on speeding this up?
Updating my post to show what I did in SQL as was suggested.
What I did in SQL was pretty straightforward. The first query was something like:
SELECT tool, date_time, message
FROM event_log
WHERE message LIKE '%foo%'
AND other selection criteria
That was fast enough, but it may return 20 or 30 thousand rows. So then I looped through the result set, and for each row ran a query like this (where dt and t are the date_time and tool from a row from the above select):
SELECT date_time, message
FROM event_log
WHERE tool = t
AND ABS(TIMESTAMPDIFF(SECOND, date_time, dt)) <= timediff
That was taking about an hour.
I also tried doing in one nested query where the inner query selected the rows from my first query, and the outer query selected the time diff rows. That took even longer.
So now I am selecting without the message LIKE '%foo%' clause and I am getting back 600,000 rows and trying to pull out the rows I want from python.
The way to optimize the SQL is to do it all in one query, instead of iterating over 20K rows and doing another query for each one.
Usually this means you need to add a JOIN, or occasionally a sub-query. And yes, you can JOIN a table to itself, as long as you rename one or both copies. So, something like this:
SELECT el2.date_time, el2.message
FROM event_log as el1 JOIN event_log as el2
WHERE el1.message LIKE '%foo%'
AND other selection criteria
AND el2.tool = el1.tool
AND ABS(TIMESTAMPDIFF(SECOND, el2.datetime, el1.datetime)) <= el1.timediff
Now, this probably won't be fast enough out of the box, so there are two steps to improve it.
First, look for any columns that obviously need to be indexed. Clearly tool and datetime need simple indices. message may benefit from either a simple index or, if your database has something fancier, maybe something fancier, but given that the initial query was fast enough, you probably don't need to worry about it.
Occasionally, that's sufficient. But usually, you can't guess everything correctly. And there may also be a need to rearrange the order of the queries, etc. So you're going to want to EXPLAIN the query, and look through the steps the DB engine is taking, and see where it's doing a slow iterative lookup when it could be doing a fast index lookup, or where it's iterating over a large collection before a small collection.
For tabular data, you can't go past the Python pandas library, which contains highly optimised code for queries like this.
I fixed this by changing my code as follows:
-first I made messageTimes a dict of lists keyed by the tool:
messageTimes = defaultdict(list) # a dict with sorted lists
for row in cdata: # tool, time, module, message
if self.message in row[3]:
messageTimes[row[0]].append(row[1])
-then in the determine function I used bisect:
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
le = bisect.bisect_right(times, tup[1])
ge = bisect.bisect_left(times, tup[1])
return (le and tup[1]-times[le-1] <= tdiff) or (ge != len(times) and times[ge]-tup[1] <= tdiff)
With these changes the code that was taking over 2 hours took under 20 minutes, and even better, a query that was taking 40 minutes took 8 seconds!
I made 2 more changes and now that 20 minute query is taking 3 minutes:
found = defaultdict(int)
def determine(tup):
if self.message in tup[3]: return True # matched message
times = messageTimes[tup[0]]
idx = found[tup[0]]
le = bisect.bisect_right(times, tup[1], idx)
idx = le
return (le and tup[1]-times[le-1] <= tdiff) or (le != len(times) and times[le]-tup[1] <= tdiff)

Categories