I am trying to use the Confluence API "get_all_pages_from_space" to retrieve all pages (400 or so in total) in a Confluence space.
# Get all pages from Space
# content_type can be 'page' or 'blogpost'. Defaults to 'page'
# expand is a comma separated list of properties to expand on the content.
# max limit is 100. For more you have to loop over start values.
confluence.get_all_pages_from_space(space, start=0, limit=100, status=None, expand=None, content_type='page')
The documentation for this API (here) says that
max limit is 100. For more you have to loop over start values.
I don't know what it means to loop over the start values in my Python code. I used this API to retrieve all the pages under a space, but it only returns the first 50 or so pages.
Is there anyone who have used this API? Please let me know how I can loop over the start values. Thank you!
A little bit late, but hopefully still might help:
def get_all_pages(confluence, space):
start = 0
limit = 100
_all_pages = []
while True:
pages = confluence.get_all_pages_from_space(space, start, limit, status=None, expand=None, content_type='page')
_all_pages = _all_pages + pages
if len(pages) < limit:
break
start = start + limit
return _all_pages
Related
I am trying to get a list of all my saved songs using current_user_saved_tracks(), but the limit is 20 tracks. Is there any way to access all 1000+ songs I have on m account?
The signature is as follows:
def current_user_saved_tracks(self, limit=20, offset=0)
The official Spotify API reference (beta) says that the maximum is limit=50. So, in a loop, call current_user_saved_tracks, but increment the offset by limit each time:
def get_all_saved_tracks(user, limit_step=50):
tracks = []
for offset in range(0, 10000000, limit_step):
response = user.current_user_saved_tracks(
limit=limit_step,
offset=offset,
)
print(response)
if len(response) == 0:
break
tracks.extend(response)
return tracks
Loop until you get an empty response or an exception. I'm not sure which one.
If you don't have to worry about the user deciding to add a saved track while you are retrieving them, this should work.
Yes, the default argument is limit=20. You could set a higher limit with the following code:
current_user_saved_tracks(limit=50)
Or you could set an offset to get the 20 next tracks:
current_user_saved_tracks(offset=20)
Source: https://spotipy.readthedocs.io/en/2.14.0/?highlight=current_user_saved#spotipy.client.Spotify.current_user_saved_tracks
wondering if anyone can give me some advice for using Selenium with Python for webscraping.
I need to get the number of elements with a certain class on a page, and I have it working well with
driver=webdriver.PhantomJS()
driver.get('https://www.somerandomsite.com/1')
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
this gets the right number of elements every time.
But now I want to define a function so it can scrape multiple webpages - say https://www.somerandomsite.com/1 to https://www.somerandomsite.com/10
So I do
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
start += 1
Theoretically, this should move onto the next page, and retrieve the number of classes that I want in that page. However, it works fine for the first page, but subsequent pages yield a number of elements that's either equal to the number of elements of the previous page plus that of the current page, or that sum minus 1. If I use an xpath instead of a class name selector I get the exact same results.
Also, if I try to access any elements that are in that longer list, it throws an error since only the values on that page actually exist. So I have no idea how it's getting that longer list if the elements on it don't even exist. (For example, if there are 8 elements on page one and 5 elements on page two, when it gets to page two it'll say there are 12 or 13 elements. If I access elements 1-5 they all return values, but trying to call the sixth element or higher will cause a NoSuchElementException.)
Anyone know why this might be happening?
EDIT: I've narrowed it down a bit more, hopefully this helps. Sorry I was off in the initial question.
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
start += 1
So the above code actually works. However, when I then navigate to another page that also has elements of 'some_class', and then continue looping, it adds the number of elements from the previous page to the current page.
So my code's like this:
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
print(number_of_elements)
driver.get('https://www.somerandomsite.com/otherpage')
start += 1
my_func(1,2)
So let's say https://www.somerandomsite.com/1 has 8 elements of class 'some_class', https://www.somerandomsite.com/otherpage has 7 elements of class 'some_class', and https://www.somerandomsite.com/2 has 10 elements of class 'some_class'.
When I run the above code, it'll print 8, then 17. If I don't navigate to the other page, and run
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
start += 1
So the above code actually works. However, when I then navigate to another page that also has elements of 'some_class', and then continue looping, it adds the number of elements from the previous page to the current page.
So my code's like this:
driver=webdriver.PhantomJS()
def my_func(start,end)
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
print(number_of_elements)
start += 1
my_func(1,2)
it'll print 8 then 10, as I want it to. I'm not sure why it's counting elements on two pages at once, and only if I get that other page beforehand.
EDIT2: So I've gotten it working by navigating to a page on a different server and then returning to the page I want. Weird, but I'll use it. If anyone has any ideas on why it doesn't work if I don't though I'd still love to understand the problem better.
Difficult to tell what - if at all - the problem is as you don't provide the necessary details to replicate what you're describing.
IMHO a function is overkill for this simple task. Just toss it and create the loop. In general I'd put the loop outside.
Also you need a function call for this to do anything at all - and a return statement.
In general for similar stuff I'd put the loop outside the function.
Like so:
def my_func(driver, count):
driver.get('https://www.somerandomsite.com/%d' % count)
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
return number_of_elements
driver=webdriver.PhantomJS()
total_element_count = 0
count = 1
while count < 1000: # or whatever number you need
number_of_elements = my_func(driver, count)
total_element_count += number_of_elements
print("[*] Elements for iteration %d: %d" % (count, number_of_elements))
print("[*] Total count so far: %d" % total_element_count)
count +=1
Take a look at
number_of_elements = len(driver.find_elements_by_class_name('some_class'))
You asign len of elements on each iteration, but instead you need to sum them, so your code should look like:
driver=webdriver.PhantomJS()
def my_func(start,end):
count = 0
while start <= end:
driver.get('https://www.somerandomsite.com/'+str(start))
count += len(driver.find_elements_by_class_name('some_class'))
start += 1
I'm trying to run a search using the New York Times' Article Search API v 2. The way the API works, for queries with more than 10 results, the results are separated into pages, and you must request each page separately. So for instance, to get results 1-10, include "page=0" in the request. For results 1-11, it's "page=1", and so on.
I want to grab all the results. (On the query I'm testing, there are 147.) The following code works only on EVEN-NUMBERED pages. So it returns results just fine when the count is 0, 2, 4. On odd-numbered requests, it seems to successfully query the API -- it returns a "Status": "OK" value -- but the list of "docs", where results are generally stored, is empty.
I can't for the life of me figure out why. Any ideas?
from nytimesarticle import articleAPI
import time
#initialize API
api = articleAPI(API_key)
#conduct search
def custom_search(page_number):
articles = api.search( q = query, page = page_number)
return articles
count = 0
while True:
#get first set of articles
articles = custom_search(count)
#print all the headlines
for article in articles['response']['docs']:
print "%s\nDate: %s\t%s\n" % (article['headline']['main'],
article['pub_date'], count)
#iterate the page number
count += 1
#pause to avoid rate limits
time.sleep(2)
I have a system that accepts messages that contain urls, if certain keywords are in the messages, an api call is made with the url as a parameter.
In order to conserve processing and keep my end presentation efficient.
I don't want duplicate urls being submitted within a certain time range.
so if this url ---> http://instagram.com/p/gHVMxltq_8/ comes in and it's submitted to the api
url = incoming.msg['urls']
url = urlparse(url)
if url.netloc == "instagram.com":
r = requests.get("http://api.some.url/show?url=%s"% url)
and then 3 secs later the same url comes in, I don't want it submitted to the api.
What programming method might I deploy to eliminate/limit duplicate messages from being submitted to the api based on time?
UPDATE USING TIM PETERS METHOD:
limit = DecayingSet(86400)
l = limit.add(longUrl)
if l == False:
pass
else:
r = requests.get("http://api.some.url/show?url=%s"% url)
this snippet is inside a long running process, that is accepting streaming messages via tcp.
every time I pass the same url in, l returns True every time.
But when I try it in the interpreter everything is good, it returns False when the set time hasn't expired.
Does it have to do with the fact that the script is running, while the set is being added to?
Instance issues?
Maybe overkill, but I like creating a new class for this kind of thing. You never know when requirements will get fancier ;-) For example,
from time import time
class DecayingSet:
def __init__(self, timeout): # timeout in seconds
from collections import deque
self.timeout = timeout
self.d = deque()
self.present = set()
def add(self, thing):
# Return True if `thing` not already in set,
# else return False.
result = thing not in self.present
if result:
self.present.add(thing)
self.d.append((time(), thing))
self.clean()
return result
def clean(self):
# forget stuff added >= `timeout` seconds ago
now = time()
d = self.d
while d and now - d[0][0] >= self.timeout:
_, thing = d.popleft()
self.present.remove(thing)
As written, it checks for expirations whenever an attempt is made to add a new thing. Maybe that's not what you want, but it should be a cheap check since the deque holds items in order of addition, so gets out at once if no items are expiring. Lots of possibilities.
Why a deque? Because deque.popleft() is a lot faster than list.pop(0) when the number of items becomes non-trivial.
suppose your desired interval is 1 hour, keep 2 counters that increment every hour but they are offset 30 minutes from each other. i. e. counter A goes 1, 2, 3, 4 at 11:17, 12:17, 13:17, 14:17 and counter B goes 1, 2, 3, 4 at 11:47, 12:47, 13:47, 14:47.
now if a link comes in and has either of the two counters same as an earlier link, then consider it to be duplicate.
the benefit of this scheme over explicit timestamps is that you can hash the url+counterA and url+counterB to quickly check whether the url exists
Update: You need two data stores: one, a regular database table (slow) with columns: (url, counterA, counterB) and two, a chunk of n bits of memory (fast). given a url so.com, counterA 17 and counterB 18, first hash "17,so.com" into a range 0 to n - 1 and see if the bit at that address is turned on. similarly, hash "18,so.com" and see if the bit is turned on.
If the bit is not turned on in either case you are sure it is a fresh URL within an hour, so we are done (quickly).
If the bit is turned on in either case then look up the url in the database table to check if it was that url indeed or some other URL that hashed to the same bit.
Further update: Bloom filters are an extension of this scheme.
I'd recommend keeping an in-memory cache of the most-recently-used URLs. Something like a dictionary:
urls = {}
and then for each URL:
if url in urls and (time.time() - urls[url]) < SOME_TIMEOUT:
# Don't submit the data
else:
urls[url] = time.time()
# Submit the data
I'm consuming (via urllib/urllib2) an API that returns XML results. The API always returns the total_hit_count for my query, but only allows me to retrieve results in batches of, say, 100 or 1000. The API stipulates I need to specify a start_pos and end_pos for offsetting this, in order to walk through the results.
Say the urllib request looks like http://someservice?query='test'&start_pos=X&end_pos=Y.
If I send an initial 'taster' query with lowest data transfer such as http://someservice?query='test'&start_pos=1&end_pos=1 in order to get back a result of, for conjecture, total_hits = 1234, I'd like to work out an approach to most cleanly request those 1234 results in batches of, again say, 100 or 1000 or...
This is what I came up with so far, and it seems to work, but I'd like to know if you would have done things differently or if I could improve upon this:
hits_per_page=100 # or 1000 or 200 or whatever, adjustable
total_hits = 1234 # retreived with BSoup from 'taster query'
base_url = "http://someservice?query='test'"
startdoc_positions = [n for n in range(1, total_hits, hits_per_page)]
enddoc_positions = [startdoc_position + hits_per_page - 1 for startdoc_position in startdoc_positions]
for start, end in zip(startdoc_positions, enddoc_positions):
if end > total_hits:
end = total_hits
print "url to request is:\n ",
print "%s&start_pos=%s&end_pos=%s" % (base_url, start, end)
p.s. I'm a long time consumer of StackOverflow, especially the Python questions, but this is my first question posted. You guys are just brilliant.
I'd suggest using
positions = ((n, n + hits_per_page - 1) for n in xrange(1, total_hits, hits_per_page))
for start, end in positions:
and then not worry about whether end exceeds hits_per_page unless the API you're using really cares whether you request something out of range; most will handle this case gracefully.
P.S. Check out httplib2 as a replacement for the urllib/urllib2 combo.
It might be interesting to use some kind of generator for this scenario to iterate over the list.
def getitems(base_url, per_page=100):
content = ...urllib...
total_hits = get_total_hits(content)
sofar = 0
while sofar < total_hits:
items_from_next_query = ...urllib...
for item in items_from_next_query:
sofar += 1
yield item
Mostly just pseudo code, but it could prove quite useful if you need to do this many times by simplifying the logic it takes to get the items as it only returns a list which is quite natural in python.
Save you quite a bit of duplicate code also.