I'd want to make a script that automatically sends alot of requests to a URL via Python.
Example link: https://page-views.glitch.me/badge?page_id=page.id
I've tried selenium but thats very slow.
pip install requests
import requests
for i in range(100): # Or whatever amount of requests you wish to send
requests.get("https://page-views.glitch.me/badge?page_id=page.id")
Or if you really wanted to hammer the address you could use multiprocessing
import multiprocessing as mp
import requests
def my_func(x):
for i in range(x):
print(requests.get("https://page-views.glitch.me/badge?page_id=page.id"))
def main():
pool = mp.Pool(mp.cpu_count())
pool.map(my_func, range(0, 100))
if __name__ == "__main__":
main()
You can send send multiple get() requests in a loop as follows:
for i in range(100):
driver.get(https://page-views.glitch.me/badge?page_id=page.id)
Related
I have been working on prometheus and Python where I want to be able to have multiple scripts that writes to Promethethus.
Currently I have done 2 scripts:
sydsvenskan.py
import time
import requests
from prometheus_client import Counter
REQUEST_COUNT = Counter(
namespace="scraper",
name="request_count",
documentation="Count the total requests",
labelnames=['http_status']
)
def monitor_feed():
while True:
with requests.get("https://sydsvenskan.se") as rep:
print("Request made!")
REQUEST_COUNT.labels(http_status=rep.status_code).inc()
time.sleep(10)
if __name__ == '__main__':
monitor_feed()
BBC.py
import time
import requests
from prometheus_client import Counter
REQUEST_COUNT = Counter(
namespace="scraper",
name="request_count",
documentation="Count the total requests",
labelnames=['http_status']
)
def monitor_feed():
while True:
with requests.get("https://bbc.com") as rep:
print("Request made!")
REQUEST_COUNT.labels(http_status=rep.status_code).inc()
time.sleep(10)
if __name__ == '__main__':
monitor_feed()
and then I have another script that just starts the promethethus http_server:
from prometheus_client import start_http_server
if __name__ == '__main__':
start_http_server(8000)
however the problem is it seems like nothing goes through the promethethus from the sydsvenskan.py and bbc.py and I wonder what am I doing wrong? I do not see any statistics growing when running the sydsvenskan and bbc at the same time
You need to combine the start_http_server function with your monitor_feed functions.
You can either combine everything under a single HTTP server.
Or, as I think you want, you'll need to run 2 HTTP servers, one with each monitor_feed:
import time
import requests
from prometheus_client import Counter
from prometheus_client import start_http_server
REQUEST_COUNT = Counter(
namespace="scraper",
name="request_count",
documentation="Count the total requests",
labelnames=['http_status']
)
def monitor_feed():
while True:
with requests.get("https://bbc.com") as rep:
print("Request made!")
REQUEST_COUNT.labels(http_status=rep.status_code).inc()
time.sleep(10)
if __name__ == '__main__':
start_http_server(8000)
monitor_feed()
In the latter case, if you run both servers on the same host machine, you'll need to use 2 different ports (you can't use 8000 for both).
I have a some threads that are downloading content for various websites using Python's built-in urllib module. The code looks something like this:
from urllib.request import Request, urlopen
from threading import Thread
##do stuff
def download(url):
req = urlopen(Request(url))
return req.read()
url = "somerandomwebsite"
Thread(target=download, args=(url,)).start()
Thread(target=download, args=(url,)).start()
#Do more stuff
The user should have an option to stop loading data, and while I can use flags/events to prevent using the data when it is done downloading if the user stops the download, I can't actually stop the download itself.
Is there a way to either stop the download (and preferably do something when the download is stopped) or forcibly (and safely) kill the thread the download is running in?
Thanks in advance.
you can use urllib.request.urlretrieve instead witch takes a reporthook argument:
from urllib.request import urlretrieve
from threading import Thread
url = "someurl"
flag = 0
def dl_progress(count, blksize, filesize):
global flag
if flag:
raise Exception('downlaod canceled')
Thread(target=urlretrieve, args=(url, "test.rar",dl_progress)).start()
if cancel_download():
flag = 1
I have a tiny stupid code, which makes a lot of requests to google search service
from concurrent.futures import ThreadPoolExecutor
import requests
import requests.packages.urllib3
requests.packages.urllib3.disable_warnings()
def check(page):
r = requests.get('https://www.google.ru/#q=test&start={}'.format(page * 10))
return len(r.text)
import time
def main():
for q in xrange(30):
st_t = time.time()
with ThreadPoolExecutor(20) as pool:
ret = [x for x in pool.map(check, xrange(1,1000))]
print time.time() - st_t
if __name__ == "__main__":
main()
And it works firstly, but then something is going wrong. All 20 threads are alive, but then they do nothing. I can see in the htop, that they are alive, but I actually don't understand why nothing happens.
Any ideas what could be wrong?
This is a known issue and the requests team did not get enough information for debugging, see this. Possible it is a CPython issue see this.
I am having list of urls like ,
l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']
I just want to see the full url from the short one for every element in that list.
Here is my approach,
import urllib2
for i in l:
print urllib2.urlopen(i).url
But when list contains thousands of url , the program takes long time.
My question : Is there is any way to reduce execution time or any other approach I have to follow ?
First method
As suggested, one way to accomplish the task would be to use the official api to bitly, which has, however, limitations (e.g., no more than 15 shortUrl's per request).
Second method
As an alternative, one could just avoid getting the contents, e.g. by using the HEAD HTTP method instead of GET. Here is just a sample code, which makes use of the excellent requests package:
import requests
l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']
for i in l:
print requests.head("http://"+i).headers['location']
from requests import get
def get_real_url_from_shortlink(url):
resp = requests.get(url)
return resp.url
I'd try twisted's asynchronous web client. Be careful with this, though, it doesn't rate-limit at all.
#!/usr/bin/python2.7
from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
locations = {}
def getLock(url, simultaneous = 1):
return locks[urlparse(url).netloc, randrange(simultaneous)]
#inlineCallbacks
def getMapping(url):
# Limit ourselves to 4 simultaneous connections per host
# Tweak this as desired, but make sure that it no larger than
# pool.maxPersistentPerHost
lock = getLock(url,4)
yield lock.acquire()
try:
resp = yield agent.request('HEAD', url)
locations[url] = resp.headers.getRawHeaders('location',[None])[0]
except Exception as e:
locations[url] = str(e)
finally:
lock.release()
dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())
reactor.run()
pprint(locations)
Please check this python code:
#!/usr/bin/env python
import requests
import multiprocessing
from time import sleep, time
from requests import async
def do_req():
r = requests.get("http://w3c.org/")
def do_sth():
while True:
sleep(10)
if __name__ == '__main__':
do_req()
multiprocessing.Process( target=do_sth, args=() ).start()
When I press Ctrl-C (wait 2sec after run - let Process run), it doesn't stop. When I change the import order to:
from requests import async
from time import sleep, time
it stops after Ctrl-C. Why it doesn't stop/kill in first example?
It's a bug or a feature?
Notes:
Yes I know, that I didn't use async in this code, this is just stripped down code. In real code I use it. I did it to simplify my question.
After pressing Ctrl-C there is a new (child) process running. Why?
multiprocessing.__version__ == 0.70a1, requests.__version__ == 0.11.2, gevent.__version__ == 0.13.7
Requests async module uses gevent. If you look at the source code of gevent you will see that it monkey patches many of Python's standard library functions, including sleep:
request.async module during import executes:
from gevent import monkey as curious_george
# Monkey-patch.
curious_george.patch_all(thread=False, select=False)
Looking at the monkey.py module of gevent you can see:
https://bitbucket.org/denis/gevent/src/f838056c793d/gevent/monkey.py#cl-128
def patch_time():
"""Replace :func:`time.sleep` with :func:`gevent.sleep`."""
from gevent.hub import sleep
import time
patch_item(time, 'sleep', sleep)
Take a look at the code from the gevent's repository for details.