delay between requests using github3 in python - python

I'm using python github3 module and i need to set delay between request to github api, because my app make to much load on server.
I'm doing things such as
git = github3.GitHub()
for i in itertools.chain(git.all_repositories(), git.repositories(type='private')):
do things
I found that GitHub use requests to make request to github api.
https://github.com/sigmavirus24/github3.py/blob/3e251f2a066df3c8da7ce0b56d24befcf5eb2d4b/github3/models.py#L233
But i can't figure out what parameter i should pass or what atribute i should change to set some delay between the requests.
Can you advise me something?

github3.py presently has no options to enforce delays between requests. That said, there is a way to get the request metadata which includes the number of requests you have left in your ratelimit as well as when that ratelimit should reset. I suggest you use git.rate_limit()['resources']['core'] to determine what delays you should set for yourself inside your own loop.

I use the following function when I expect to exceed my query limit:
def wait_for_karma(gh, min_karma=25, msg=None):
while gh:
core = gh.rate_limit()['resources']['core']
if core['remaining'] < min_karma:
now = time.time()
nap = max(core['reset'] - now, 0.1)
logger.info("napping for %s seconds", nap)
if msg:
logger.info(msg)
time.sleep(nap)
else:
break
I'll call it before making a call that I believe is "big" (i.e. could require multiple API calls to satisfy). Based on your code sample, you may want to do this at the bottom of your loop:
git = github3.GitHub()
for i in itertools.chain(git.all_repositories(), git.repositories(type='private')):
do_things()
wait_for_karma(git, msg="pausing")

Related

Python Flask: How to wait for webhook to be executed?

I am working on a Python flask app, and the main method start() calls an external API (third_party_api_wrapper()). That external API has an associated webhook (webhook()) that receives the output of that external API call (note that the output that webhook() receives is actually different from the response returned in the third_party_wrapper())
The main method start() needs the result of webhook(). How do I make start() wait for webhook() to be executed? And how do wo pass the returned value of webhook() back to start()?
Here's is a minimal code snippet to capture the scenario.
#app.route('/webhook', methods=['POST'])
def webhook():
return "webhook method has executed"
# this method has a webhook that calls webhook() after this method has executed
def third_party_api_wrapper():
url = 'https://api.thirdparty.com'
response = requests.post(url)
return response
# this is the main entry point
#app.route('/start', methods=['POST'])
def start():
third_party_api_wrapper()
# The rest of this code depends on the output of webhook().
# How do we wait until webhook() is called, and how do we access the returned value?
The answer to this question really depends on how you plan on running your app in production. It's much simpler if we make the assumption that you only plan to have a single instance of your app running at once (as opposed to multiple behind a load balancer, for example), so I'll make that assumption first to give you a place to start, and comment on a more "production-ready" solution afterwards.
A big thing to keep in mind when writing a web application is that you have to understand how you want the outside world to interact with your app. Do you expect to have the /start endpoint called only once at the beginning of your app's lifetime, or is this a generic endpoint that may start any number of background processes that you want the caller of each to wait for? Or, do you want the behavior where any caller after the first one will wait for the same process to complete as the first one? I can't answer these questions for you, it depends on the use-case you're trying to implement. I'll give you a relatively simple solution that you should be able to modify to fulfill any of the ones I mentioned though.
This solution will use the Event class from the threading standard library module; I added some comments to clarify which parts you may have to change depending on the specifics of the API you're calling and stuff like that.
import threading
import uuid
from typing import Any
import requests
from flask import Flask, Response, request
# The base URL for your app, if you're running it locally this should be fine
# however external providers can't communicate with your `localhost` so you'll
# need to change this for your app to work end-to-end.
BASE_URL = "http://localhost:5000"
app = Flask(__name__)
class ThirdPartyProcessManager:
def __init__(self) -> None:
self.events = {}
self.values = {}
def wait_for_request(self, request_id: str) -> None:
event = threading.Event()
actual_event = self.events.setdefault(request_id, event)
if actual_event is not event:
raise ValueError(f"Request {request_id} already exists.")
event.wait()
return self.values.pop(request_id)
def finish_request(self, request_id: str, value: Any) -> None:
event = self.events.pop(request_id, None)
if event is None:
raise ValueError(f"Request {request_id} does not exist.")
self.values[request_id] = value
event.set()
MANAGER = ThirdPartyProcessManager()
# This is assuming that you can specify the callback URL per-request, otherwise
# you may have to get the request ID from the body of the request or something
#app.route('/webhook/<request_id>', methods=['POST'])
def webhook(request_id: str) -> Response:
MANAGER.finish_request(request_id, request.json)
return "webhook method has executed"
# Somehow in here you need to create or generate a unique identifier for this
# request--this may come from the third-party provider, or you can generate one
# yourself. There are three main paths I see here:
# - If you can specify the callback/webhook URL in each call, you can just pass them
# <base>/webhook/<request_id> and use that to identify which request is being
# responded to in the webhook.
# - If the provider gives you a request ID, you can return it from this function
# then retrieve it from the request body in the webhook route
# For now, I'll assume the first situation but you should be able to implement the second
# with minimal changes
def third_party_api_wrapper() -> str:
request_id = uuid.uuid4().hex
url = 'https://api.thirdparty.com'
# Just an example, I don't know how the third party API you're working with works
response = requests.post(
url,
json={"callback_url": f"{BASE_URL}/webhook/{request_id}"}
)
# NOTE: unrelated to the problem at hand, you should always check for errors
# in HTTP responses. This method is an easy way provided by requests to raise
# for non-success status codes.
response.raise_for_status()
return request_id
#app.route('/start', methods=['POST'])
def start() -> Response:
request_id = third_party_api_wrapper()
result = MANAGER.wait_for_request(request_id)
return result
If you want to run the example fully locally to test it, do the following:
Comment out lines 62-71, which actually make the external API call
Add a print statement after line 77, so that you can get the ID of the "in flight" request. E.g. print("Request ID", request_id)
In one terminal, run the app by pasting the above code into an app.py file and running flask run in that directory.
In another terminal, start the process via:
curl -XPOST http://localhost:5000/start
Copy the request ID that will be logged in the first terminal that's running the server.
In a third terminal, complete the process by calling the webhook:
curl -XPOST http://localhost:5000/webhook/<your_request_id> -H Content-Type:application/json -d '{"foo":"bar"}'
You should see {"foo":"bar"} as the response in the second terminal that made the /start request.
I hope that's enough to help you get started w/ whatever problem you're trying to solve.
There are a couple of design-y comments I have based on the information provided as well:
As I mentioned before, this will not work if you have more than one instance of the app running at once. This works by storing the state of in-flight requests in a global state inside your python process, so if you have more than one process, they won't all be working and modifying the same state. If you need to run more than one instance of your process, I would use a similar approach with some database backend to store the shared state (assuming your requests are pretty short-lived, Redis might be a good choice here, but once again it'll depend on exactly what you're trying to do).
Even if you do only have one instance of the app running, flask is capable of being run in a variety of different server contexts--for example, the server might be using threads (the default), greenlets via gevent or a similar library, or multiple processes, or maybe some other approach entirely in order to handle multiple requests concurrently. If you're using an approach that creates multiple processes, you should be able to use the utilities provided by the multiprocessing module to implement the same approach as I've given above.
This approach probably will work just fine for something where the difference in time between the API call and the webhook response is small (on the order of a couple of seconds at most I'd say), but you should be wary of using this approach for something where the difference in time can be quite large. If the connection between the client and your server fails, they'll have to make another request and run the long-running process that your third party is completing for you again. Some proxies and load balancers may also have time out behavior that could terminate the request after a certain amount of time even if nothing goes wrong in the connection between your server and the client making a request to it. An alternative approach would be for your /start endpoint to return quickly and give the client a request_id that they could poll for updates. As an example, AWS Athena's API is structured like this--there is a StartQueryExecution method, and separate GetQueryExecution and GetQueryResults methods that the client makes requests to check the status of a query and retrieve the results respectively (there are also other methods like StopQueryExecution and GetQueryRuntimeStatistics available as well). You can check out the documentation here.
I know that's a lot of info, but I hope it helps. Happy to update the answer w/ more specific info if you'll provide some more details about your use-case.

Get the duration of a URL-based Media object with python-vlc - Cannot parse

I'm trying to use the python 2.7 python-vlc to parse then get the duration of a music track from a URL. Parsing doesn't work and playing then pausing the media returns -1 for the duration occasionally.
There are two ways I know of to parse media, which has to be done before using media.get_duration(). I can parse it, or I can play it.
No matter what, I cannot parse the media. Using parse_with_options() gives me parsed status MediaParsedStatus.skipped for everything except for parse_with_option(1,0)which gives me parsed status MediaParsedStatus.FIXME_(0L)
p = vlc.MediaPlayer(songurl)
media = p.get_media()
media.parse_with_options(1, 0)
print media.get_parsed_status()
print media.get_duration()
The string "songurl" is the actual streaming URL of a song from Youtube or Google Play Music, which works perfectly fine with the MediaPlayer.
I have also tried playing the media for short 0.01 to 0.5 second periods then attempting to get the time, which works MOST OF THE TIME but randomly returns a duration of -1 about 1 in 10 times. Using media.get_duration() again returns the same result.
I would prefer to just parse the song rather than worry about playing it, but I can't figure out any way to parse it.
I already submitted a bug report to the python-vlc github since I figure MediaParsedStatus.FIXME_(0L) is some sort of bug.
UPDATE: I GOT IT! This was possibly the biggest pain in all my programming career (which isnt much). Here's the code used to get the time for a URL track:
instance = vlc.Instance()
media = instance.media_new(songurl)
player = instance.media_player_new()
player.set_media(media)
#Start the parser
media.parse_with_options(1,0)
while True:
if str(media.get_parsed_status()) == 'MediaParsedStatus.done':
break #Might be a good idea to add a failsafe in here.
print media.get_duration()
media.parse_with_options is asynchronous. So your code isn't waiting for a response from URL, it's just immediately moving on. As with all asynchronous methods, you need to receive a notification that the data has been received and then you can move on. In this case it looks like it is the MediaParsedChanged event.
https://www.videolan.org/developers/vlc/doc/doxygen/html/group__libvlc__media.html#ga55f5a33e22aa32e17a9bb75decd1497b
Alternatively, you should be able to use the parse() method which is synchronous and will block until the meta data is received. This isn't recommended (and it's deprecated) because it could block indefinitely and lock up. But it is an option depending on what you are using the code for.
https://www.videolan.org/developers/vlc/doc/doxygen/html/group__libvlc__media.html#ga4b71084fb35b3dd8cc6457a4d27baf0c
EDIT:
If you need an example of using the event manager with the python bindings, here is a great example:
VLC Python EventManager callback type?
Particularly, look at Rolf's answer as the way he is using it might be a good starting point for you.
import vlc
parseReady = 0
def ParseReceived(event):
global parseReady
#set a flag that your data is ready
parseReady = 1
...
events = player.event_manager()
events.event_attach(vlc.EventType.MediaParsedChanged, ParseReceived)
...
parseReady = 0
media.parse_with_options(1, 0)
while parseReady == 0:
#TODO: spin something to waste time
#Once the flag is set, your data is ready
print media.get_parsed_status()
print media.get_duration()
There are undoubtedly better ways to do it but that's a basic example. Note, according to the documentation, you can not call vlc methods from within an event callback. Thus the use of a simple flag rather that calling the media methods directly in the callback.
libvlc will not parse network resources by default. You need to call parse with options with libvlc_media_parse_network.

Have a python function run for an alotted time

I have a python script that pulls from various internal network sources. With how our systems are set up we will initiate a urllib pull from a network location and it will get hung up waiting forever for a response on certain parts of the network. I would like my script to check that if it hasnt finished the pull in lets say 5 minutes it will pass the function and attempt to pull from the next address, and record it to a bad directory repository(so we can go check out which systems get hung up, there's like over 20,000 IP addresses we are checking some with some older scripts running on them that no longer work but will still try and run when requested, and they never stop trying to run)
Im familiar with having a script pause at a certain point
import time
time.sleep(300)
What Im thinking from a psuedo code perspective (not proper python just illustrating the idea)
import time
import urllib2
url_dict = ['http://1', 'http://2', 'http://3', ...]
fail_log_path = 'C:/Temp/fail_log.txt'
for addresses in url_dict:
clock_value = time.start()
while clock_value <= 300:
print str(clock_value)
res = urllib2.retrieve(url)
if res != []:
pass
else:
fail_log = open(fail_log_path, 'a')
fail_log.write("Failed to pull from site location: " + str(url) + "\n")
faile_log.close
Update: a specific option for this dealing with urls timeout for urllib2.urlopen() in pre Python 2.6 versions
Found this answer which is more in line with the overall problem of my question:
kill a function after a certain time in windows
Your code as is doesn't seem to describe what you were saying. It seems you want the if/else check inside your while loop. On top of that, you would want to loop over the ip addresses and not over a time period as your code is currently written (otherwise you will keep requesting the same ip address every time). Instead of keeping track of time yourself, I would suggest reading up on urllib.request.urlopen - specifically the timeout parameter. Once set, that function call will throw a socket.timeout exception once the time limit is reached. Surround that with a try/except block catching that error and then handle it appropriately.

How to get location of all user without hitting github API usage limit

Currently I am trying to get all Github user location. I am using github3 python library to get the location. But it gives me over-API usage error when my api calls are more than 5K. Here is my code.
import github3
from datetime import datetime
import sys
def main(pswd):
g = github3.login(username="rakeshcusat", password=pswd)
current_time = datetime.now()
fhandler = open("githubuser_"+current_time.strftime("%d-%m-%y-%H:%M:%S"), "w")
for user in g.iter_all_users():
user.refresh()
try:
fhandler.write(" user: {0}, email: {1}, location: {2}\n".format(str(user), str(user.email), str(user.location)))
except:
print "Something wrong, user id : {0}".format(user.id);
fhandler.close()
if __name__ == "__main__":
if len(sys.argv) == 2:
main(sys.argv[1])
else:
print "Please provide your password"
I can do this by downloading all username first which will be only single API call. And then iteratively download the user location. If hit over-usage then wait for one hour and resume the api call where it was left. But this seems like a lame solution and definitely it will take more time(almost 25+ hours). Can some one provide me better way of doing this?
So if you use the development version of github3.py you can use the per_page parameter, e.g.,
for user in g.iter_all_users(per_page=200):
user.refresh()
#: other logic
The thing is, you'll save 7 requests using per_page (1 request now returns 25 if I remember correctly, so you'll get the equivalent of 8 requests in 1). The problem is you're then using 200 requests rather quickly with User#refresh. What you could do, to avoid the ratelimit is to use sleep in your code to space out your requests. 5000 requests split over 3600 seconds is 1.389 requests per second. If each request takes half a second (which I think is an underestimation personally), you could do
import time
for user in g.iter_all_users(per_page=200):
user.refresh()
#: other logic
time.sleep(0.5)
This will make sure one request is made per second and that you never hit the ratelimit. Regardless, it's rather lame.
In the future, I would store these values in the database using the user's id as the id in the database and then just look for the max and try to start there. I'll have to check if /users supports something akin to the since parameter. Alternatively, you could also work like so
import time
i = g.iter_all_users(per_page=200):
for user in i:
user.refresh()
#: other logic
time.sleep(0.5)
# We have all users
# store i.etag somewhere then later
i = g.iter_all_users(per_page=200, etag=i.etag)
for user in i:
user.refresh()
#: etc
The second iterator should give you all new users since the last one in your last request if I remember correctly but I'm currently very tired so I could be remembering something wrong.

Find out if the current machine is on aws in python

I have a python script that runs on aws machines, as well as on other machines.
The functionality of the script depends on whether or not it is on AWS.
Is there a way to programmatically discover whether or not it runs on AWS? (maybe using boto?)
If you want to do that strictly using boto, you could do:
import boto.utils
md = boto.utils.get_instance_metadata(timeout=.1, num_retries=0)
The timeout specifies the how long the HTTP client will wait for a response before timing out. The num_retries parameter controls how many times the client will retry the request before giving up and returning and empty dictionary.
you can easily use the AWS SDK and check for instance id.
beside of that, you can check the aws ip ranges - check out this link
https://forums.aws.amazon.com/ann.jspa?annID=1701
I found a way, using:
try:
instance_id_resp = requests.get('http://169.254.169.254/latest/meta-data/instance-id')
is_on_aws = True
except requests.exceptions.ConnectionError as e:
is_on_awas = False
I tried some of the above, and when not running on Amazon I had troubles accessing 169.254.169.254. Maybe it has something to do with the fact I'm outside the US.
In any case, here's a piece of code that worked for me:
def running_on_amazon():
import urllib2
import socket
# I'm using curlmyip.com, but there are other websites that provide the same service
ip_finder_addr = "http://curlmyip.com"
f = urllib2.urlopen(ip_finder_addr)
my_ip = f.read(100).strip()
host_addr = socket.gethostbyaddr(my_ip)
my_public_name = host_addr[0]
amazon = (my_public_name.find("aws") >=0 )
return amazon # returns a boolean value.

Categories