Limits for Spotipy? - python

I am trying to get a list of all my saved songs using current_user_saved_tracks(), but the limit is 20 tracks. Is there any way to access all 1000+ songs I have on m account?

The signature is as follows:
def current_user_saved_tracks(self, limit=20, offset=0)
The official Spotify API reference (beta) says that the maximum is limit=50. So, in a loop, call current_user_saved_tracks, but increment the offset by limit each time:
def get_all_saved_tracks(user, limit_step=50):
tracks = []
for offset in range(0, 10000000, limit_step):
response = user.current_user_saved_tracks(
limit=limit_step,
offset=offset,
)
print(response)
if len(response) == 0:
break
tracks.extend(response)
return tracks
Loop until you get an empty response or an exception. I'm not sure which one.
If you don't have to worry about the user deciding to add a saved track while you are retrieving them, this should work.

Yes, the default argument is limit=20. You could set a higher limit with the following code:
current_user_saved_tracks(limit=50)
Or you could set an offset to get the 20 next tracks:
current_user_saved_tracks(offset=20)
Source: https://spotipy.readthedocs.io/en/2.14.0/?highlight=current_user_saved#spotipy.client.Spotify.current_user_saved_tracks

Related

How to update of Azure Table Storage Entity with ETag

Overview: When I upload blob in blob storage under container/productID(folder)/blobName, then event-subscription saves this event in the storage queue. After that azure function polls this event and does the following:
1- read from the corresponding table the current count property (how
many blobs are stored under productID(folder)) with the etag
2- increase the count + 1
3- write it back in the corresponding table, if ETag is matched, then the field count will be increased, otherwise throws an error. if err is thrown, sleep while and go to step 1 (while loop)
4- if property field successful increased, then return
scenario: trying to upload five items to blob storage
Expectation: the count property in table storage stores 5.
problem: after inserting the first four items successful, the code get in an infinite loop for inserting the fifth item, and the count property increased forever. why that could happen, I don't have any ideas. any ideas from you will be good
#more code
header_etag = "random-etag"
response_etag = "random-response"
while(response_etag != header_etag):
sleep(random.random()) # sleep between 0 and 1 second.
header = table_service.get_entity_table(
client_table, client_table, client_product)
new_count = header['Count'] + 1
entity_product = create_product_entity(
client_table, client_product, new_count, client_image_table)
header_etag = header['etag']
try:
response_etag = table_service1.merge_entity(client_table, entity_product,
if_match=header_etag)
except:
logging.info("race condition detected")
Try implementing your parameters and logic in while loop with below code:
val = 0 # Initial value as zero
success = False
while True:
val = val + 1 #incrementing
CheckValidations = input(‘Check’) #add validations
if CheckValidations == "abc123":
success = True # this will allow the loop
break
if val > 5:
break
print("Please Try Again")
if success == True:
print("Welcome!")
Also check for the below .py file from Azure Python SDK documentation where we have a clear function for update, merge and insert for entities in table storage
azure-sdk-for-python/sample_update_upsert_merge_entities.py at main · Azure/azure-sdk-for-python (github.com)
Refer to Microsoft documentation to check how can we pass exact parameters for create entity.
For more insight on table samples check Azure samples
storage-table-python-getting-started/table_basic_samples.py at master · Azure-Samples/storage-table-python-getting-started (github.com)

How to use the Confluence API "get_all_pages_from_space"?

I am trying to use the Confluence API "get_all_pages_from_space" to retrieve all pages (400 or so in total) in a Confluence space.
# Get all pages from Space
# content_type can be 'page' or 'blogpost'. Defaults to 'page'
# expand is a comma separated list of properties to expand on the content.
# max limit is 100. For more you have to loop over start values.
confluence.get_all_pages_from_space(space, start=0, limit=100, status=None, expand=None, content_type='page')
The documentation for this API (here) says that
max limit is 100. For more you have to loop over start values.
I don't know what it means to loop over the start values in my Python code. I used this API to retrieve all the pages under a space, but it only returns the first 50 or so pages.
Is there anyone who have used this API? Please let me know how I can loop over the start values. Thank you!
A little bit late, but hopefully still might help:
def get_all_pages(confluence, space):
start = 0
limit = 100
_all_pages = []
while True:
pages = confluence.get_all_pages_from_space(space, start, limit, status=None, expand=None, content_type='page')
_all_pages = _all_pages + pages
if len(pages) < limit:
break
start = start + limit
return _all_pages

do variables need to be instantiated before while loop python

i'm trying to scrape more 500 posts with the reddit api - without praw. however, since i'm only allowed 100 posts at a time, i'm saving the scraped objects in an array called subreddit_content and will be scraping until there are 500 posts in subreddit_content.
the code below gives me NameError: name 'subreddit_content_more' is not defined. if i instantiate subreddit_data_more = None before the while loop, i get TypeError: 'NoneType' object is not subscriptable. i've tried the same thing with a for loop but get the same results.
EDIT: updated code, while loop now uses subreddit_data instead of subreddit_data_more, but now getting TypeError: 'Response' object is not subscriptable despite converting subreddit_data to json.
subreddit_data = requests.get(f'https://api.reddit.com/r/{subreddit}/hot?limit=100', headers={'User-Agent': 'windows:requests (by /u/xxx)'})
subreddit_content = subreddit_data.json()['data']['children']
lastline_json = subreddit_content[-1]['data']['name']
while (len(subreddit_content) < 500):
subreddit_data = requests.get(f'https://api.reddit.com/r/{subreddit}/hot?limit=100&after={lastline_json}', headers={'User-Agent': 'windows:requests (by /u/xxx)'})
subreddit_content = subreddit_content.append(subreddit_data.json()['data']['children'])
lastline_json = subreddit_data[-1]['data']['name']
time.sleep(2.5)
EDIT2: using .extend instead of .append and removing the variable assignment in the loop seemed to do the trick. this is the snippet of working code (also renamed my variables for readability, courtesy of Wups):
data = requests.get(f'https://api.reddit.com/r/{subreddit}/hot?limit=100', headers={'User-Agent': 'windows:requests (by /u/xxx)'})
content_list = data.json()['data']['children']
lastline_name = content_list[-1]['data']['name']
while (len(content_list) < 500):
data = requests.get(f'https://api.reddit.com/r/{subreddit}/hot?limit=100&after={lastline_name}', headers={'User-Agent': 'windows:requests (by /u/xxx)'})
content_list.extend(data.json()['data']['children'])
lastline_name = content_list[-1]['data']['name']
time.sleep(2)
You want to just add one list to another list, but you're doing it wrong. One way to do that is:
the_next_hundred_records = subreddit_data.json()['data']['children']
subreddit_content.extend(the_next_hundred_records)
compare append and extend at https://docs.python.org/3/tutorial/datastructures.html
What you did with append was add the full list of the next 100 as a single sub-list at position 101. Then, because list.append returns None, you set subreddit_content = None
Let's try some smaller numbers so you can see what's going on in the debugger. Here is your code, super simplified, except instead of doing requests to get a list from subreddit, I just made a small list. Same thing, really. And I used multiples of ten instead of 100.
def do_query(start):
return list(range(start, start+10))
# content is initialized to a list by the first query
content = do_query(0)
while len(content) < 50:
next_number = len(content)
# there are a few valid ways to add to a list. Here's one.
content.extend(do_query(next_number))
for x in content:
print(x)
It would be better to use a generator, but maybe that's a later topic. Also, you might have problems if the subreddit actually has less than 500 records.

tweepy: get all mentions with api.search using max_id and since_id

I followed this link here to get all tweets that mention a certain query.
Now, the code works fine so far, I just want to make sure I actually understand anything since I don't want to use some code even though I don't even know how it does what it does.
This is my relevant code:
def searchMentions (tweetCount, maxTweets, searchQuery, tweetsPerQry, max_id, sinceId) :
while tweetCount < maxTweets:
if (not max_id):
if (not sinceId):
new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
else:
new_tweets = api.search(q=searchQuery, count = tweetsPerQry, since_id = sinceId)
else:
if (not sinceId):
new_tweets = api.search(q=searchQuery, count= tweetsPerQry, max_id=str(max_id -1))
else:
new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id -1), since_id=sinceId)
if not new_tweets:
print("No new tweets to show")
break
for tweet in new_tweets :
try :
tweetCount += len(new_tweets)
max_id = new_tweets[-1].id
tweetId = tweet.user.id
username = tweet.user.screen_name
api.update_status(tweet.text)
print(tweet.text)
except tweepy.TweepError as e:
print(e.reason)
except StopIteration:
pass
max_id and sinceId are both set to None since no tweets have been found yet, I assume. tweetCount is set to zero.
The way I understand it, is that the while-loop runs while tweetCount < maxTweets. I'm not exactly sure why that is the case and why I can't use while True, for instance. At first I thought maybe it has to do with the rate of api calls but that doesn't really make sense.
Afterwards, the function checks for max_id and sinceId. I assume it checks if there is already a max_id and if max_id is none, it checks for sinceId. If sinceId is none then it simply gets however many tweets the count parameter is set to, otherwise it sets the lower bound to sinceId and gets however many tweets the count parameter is set to from sinceId on.
If max_id is not none, but if sinceId is set to none, it sets the upper limit to max_id and gets a certain number of tweets until and including that bound. So if you had tweets with the ids 1,2,3,4,5 and with count=3 and max_id=5 you would get the tweets 3,4,5. Otherwise it sets the lower bound to sinceId and the upper vound to max_id and gets the tweets "in between".
Tweets that are found are saved in new_tweets.
Now, the function iterates through all tweets in new_tweets and sets the tweetCount to the length of this list. Then max_id is set to new_tweets[-1].id. Since twitter specifies that max_id is inclusive, I assume this is set to the next tweet before the last tweet so tweets aren't repeated, however, I'm not so sure about it and I don't understand how my function would know what the id before the last tweet could be.
A tweet that repeats whatever the tweet in new_tweets said is posted.
So, to sum it up, my questions are:
Can I do while True instead of while tweetCount < maxTweets and if not, why?
Is the way I explained the function correct, if not, where did I go wrong?
What does max_id = new_tweets[-1].id do exactly?
Why do we not set sinceId to a new value in the for-loop? Since sinceId is set to None in the beginning, it seems unnecessary to go through the options of sinceId not being set to None if we do not change the value anywhere.
As a disclaimer: I did read through twitters explantion explanation of max_id, since_id, counts, etc. but it did not answer my questions.
A few months ago, i used the same reference for the Search API. I came to understand a few things that might help you. I have assumed that the API returns tweets in an orderly fashion (Descending order of tweet_id).
Let's assume we have a bunch of tweets ,that twitter is giving us for a query, with the tweet ids from 1 to 10 ( 1 being the oldest and 10 the newest ).
1 2 3 4 5 6 7 8 9 10
since_id = lower bound and
max_id = upper bound
Twitter starts to return the tweets in the order of newest to oldest ( from 10 to 1 ). Let's take some examples:
# This would return tweets having id between 4 and 10 ( 4 and 10 inclusive )
since_id=4,max_id=10
# This means there is no lower bound, and we will receive as many
# tweets as the Twitter Search API permits for the free version ( i.e. for the last 7
# days ). Hence, we will get tweets with id 1 to 10 ( 1 and 10 inclusive )
since_id=None, max_id=10
What does max_id = new_tweets[-1].id do exactly?
Suppose in the first API call we received 4 tweets only, i.e. 10, 9, 8, 7. Hence, the new_tweets list becomes( i am assuming it to be a list of ids for the purpose of explanation, it is actually a list of objects ) :
new_tweets=[10,9,8,7]
max_id= new_tweets[-1] # max_id = 7
Now when our program hits the API for the second time:
max_id = 7
since_id = None
new_tweets = api.search(q=searchQuery, count=tweetsPerQry, max_id=str(max_id -1), since_id=sinceId)
# We will receive all tweets from 6 to 1 now.
max_id = 6 # max_id=str(max_id -1)
#Therefore
new_tweets = [6,5,4,3,2,1]
This way of using the API ( as mentioned in the reference ) can return a maximum of 100 tweets, for every API call we make. The actual number of tweets returned is less than 100 and also depends on how complex your query is, the less complex, the better.
Why do we not set sinceId to a new value in the for-loop? Since sinceId is set to None in the beginning, it seems unnecessary to go through the options of sinceId not being set to None if we do not change the value anywhere.
Setting sinceId=None returns the oldest of the tweets, but i am unsure of what the default value of sinceId is, if we don't mention it.
Can I do while True instead of while tweetCount < maxTweets and if not, why?
You can do this, but you then need to handle the exceptions that you'll get for reaching the rate limit ( i.e. 100 tweets per call ). Using this makes the handling of the program easier.
I hope this helps you.
Can I do while True instead of while tweetCount < maxTweets and if not, why?
It's been a while since I used the Twitter API but if I recall correctly, you have a limited amount of calls and tweets in an hour. This is to keep Twitter relatively clean. I recall maxTweets should be the amount you want to fetch. That's why you probably wouldn't want to use while True, but I believe you can replace it without any problems. You'll reach an exception eventually, that will be the API telling you you reached your max amount.
What does max_id = new_tweets[-1].id do exactly?
Every tweet has an ID, that's the one you see in the URL when you open it. You use it to reference a specific tweet in your code. What that code does is update the ID of the last tweet in the returned list to your last tweet's ID. (basically update the variable). Remember calling negative indexes refers to elements from the end of the list and backwards.
I am not 100% sure about your other two questions, I'll edit later if I find anything.

Best way to limit incoming messages to avoid duplicates

I have a system that accepts messages that contain urls, if certain keywords are in the messages, an api call is made with the url as a parameter.
In order to conserve processing and keep my end presentation efficient.
I don't want duplicate urls being submitted within a certain time range.
so if this url ---> http://instagram.com/p/gHVMxltq_8/ comes in and it's submitted to the api
url = incoming.msg['urls']
url = urlparse(url)
if url.netloc == "instagram.com":
r = requests.get("http://api.some.url/show?url=%s"% url)
and then 3 secs later the same url comes in, I don't want it submitted to the api.
What programming method might I deploy to eliminate/limit duplicate messages from being submitted to the api based on time?
UPDATE USING TIM PETERS METHOD:
limit = DecayingSet(86400)
l = limit.add(longUrl)
if l == False:
pass
else:
r = requests.get("http://api.some.url/show?url=%s"% url)
this snippet is inside a long running process, that is accepting streaming messages via tcp.
every time I pass the same url in, l returns True every time.
But when I try it in the interpreter everything is good, it returns False when the set time hasn't expired.
Does it have to do with the fact that the script is running, while the set is being added to?
Instance issues?
Maybe overkill, but I like creating a new class for this kind of thing. You never know when requirements will get fancier ;-) For example,
from time import time
class DecayingSet:
def __init__(self, timeout): # timeout in seconds
from collections import deque
self.timeout = timeout
self.d = deque()
self.present = set()
def add(self, thing):
# Return True if `thing` not already in set,
# else return False.
result = thing not in self.present
if result:
self.present.add(thing)
self.d.append((time(), thing))
self.clean()
return result
def clean(self):
# forget stuff added >= `timeout` seconds ago
now = time()
d = self.d
while d and now - d[0][0] >= self.timeout:
_, thing = d.popleft()
self.present.remove(thing)
As written, it checks for expirations whenever an attempt is made to add a new thing. Maybe that's not what you want, but it should be a cheap check since the deque holds items in order of addition, so gets out at once if no items are expiring. Lots of possibilities.
Why a deque? Because deque.popleft() is a lot faster than list.pop(0) when the number of items becomes non-trivial.
suppose your desired interval is 1 hour, keep 2 counters that increment every hour but they are offset 30 minutes from each other. i. e. counter A goes 1, 2, 3, 4 at 11:17, 12:17, 13:17, 14:17 and counter B goes 1, 2, 3, 4 at 11:47, 12:47, 13:47, 14:47.
now if a link comes in and has either of the two counters same as an earlier link, then consider it to be duplicate.
the benefit of this scheme over explicit timestamps is that you can hash the url+counterA and url+counterB to quickly check whether the url exists
Update: You need two data stores: one, a regular database table (slow) with columns: (url, counterA, counterB) and two, a chunk of n bits of memory (fast). given a url so.com, counterA 17 and counterB 18, first hash "17,so.com" into a range 0 to n - 1 and see if the bit at that address is turned on. similarly, hash "18,so.com" and see if the bit is turned on.
If the bit is not turned on in either case you are sure it is a fresh URL within an hour, so we are done (quickly).
If the bit is turned on in either case then look up the url in the database table to check if it was that url indeed or some other URL that hashed to the same bit.
Further update: Bloom filters are an extension of this scheme.
I'd recommend keeping an in-memory cache of the most-recently-used URLs. Something like a dictionary:
urls = {}
and then for each URL:
if url in urls and (time.time() - urls[url]) < SOME_TIMEOUT:
# Don't submit the data
else:
urls[url] = time.time()
# Submit the data

Categories