I am having list of urls like ,
l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']
I just want to see the full url from the short one for every element in that list.
Here is my approach,
import urllib2
for i in l:
print urllib2.urlopen(i).url
But when list contains thousands of url , the program takes long time.
My question : Is there is any way to reduce execution time or any other approach I have to follow ?
First method
As suggested, one way to accomplish the task would be to use the official api to bitly, which has, however, limitations (e.g., no more than 15 shortUrl's per request).
Second method
As an alternative, one could just avoid getting the contents, e.g. by using the HEAD HTTP method instead of GET. Here is just a sample code, which makes use of the excellent requests package:
import requests
l=['bit.ly/1bdDlXc','bit.ly/1bdDlXc',.......,'bit.ly/1bdDlXc']
for i in l:
print requests.head("http://"+i).headers['location']
from requests import get
def get_real_url_from_shortlink(url):
resp = requests.get(url)
return resp.url
I'd try twisted's asynchronous web client. Be careful with this, though, it doesn't rate-limit at all.
#!/usr/bin/python2.7
from twisted.internet import reactor
from twisted.internet.defer import Deferred, DeferredList, DeferredLock
from twisted.internet.defer import inlineCallbacks
from twisted.web.client import Agent, HTTPConnectionPool
from twisted.web.http_headers import Headers
from pprint import pprint
from collections import defaultdict
from urlparse import urlparse
from random import randrange
import fileinput
pool = HTTPConnectionPool(reactor)
pool.maxPersistentPerHost = 16
agent = Agent(reactor, pool)
locks = defaultdict(DeferredLock)
locations = {}
def getLock(url, simultaneous = 1):
return locks[urlparse(url).netloc, randrange(simultaneous)]
#inlineCallbacks
def getMapping(url):
# Limit ourselves to 4 simultaneous connections per host
# Tweak this as desired, but make sure that it no larger than
# pool.maxPersistentPerHost
lock = getLock(url,4)
yield lock.acquire()
try:
resp = yield agent.request('HEAD', url)
locations[url] = resp.headers.getRawHeaders('location',[None])[0]
except Exception as e:
locations[url] = str(e)
finally:
lock.release()
dl = DeferredList(getMapping(url.strip()) for url in fileinput.input())
dl.addCallback(lambda _: reactor.stop())
reactor.run()
pprint(locations)
Related
I have a some threads that are downloading content for various websites using Python's built-in urllib module. The code looks something like this:
from urllib.request import Request, urlopen
from threading import Thread
##do stuff
def download(url):
req = urlopen(Request(url))
return req.read()
url = "somerandomwebsite"
Thread(target=download, args=(url,)).start()
Thread(target=download, args=(url,)).start()
#Do more stuff
The user should have an option to stop loading data, and while I can use flags/events to prevent using the data when it is done downloading if the user stops the download, I can't actually stop the download itself.
Is there a way to either stop the download (and preferably do something when the download is stopped) or forcibly (and safely) kill the thread the download is running in?
Thanks in advance.
you can use urllib.request.urlretrieve instead witch takes a reporthook argument:
from urllib.request import urlretrieve
from threading import Thread
url = "someurl"
flag = 0
def dl_progress(count, blksize, filesize):
global flag
if flag:
raise Exception('downlaod canceled')
Thread(target=urlretrieve, args=(url, "test.rar",dl_progress)).start()
if cancel_download():
flag = 1
I'd want to make a script that automatically sends alot of requests to a URL via Python.
Example link: https://page-views.glitch.me/badge?page_id=page.id
I've tried selenium but thats very slow.
pip install requests
import requests
for i in range(100): # Or whatever amount of requests you wish to send
requests.get("https://page-views.glitch.me/badge?page_id=page.id")
Or if you really wanted to hammer the address you could use multiprocessing
import multiprocessing as mp
import requests
def my_func(x):
for i in range(x):
print(requests.get("https://page-views.glitch.me/badge?page_id=page.id"))
def main():
pool = mp.Pool(mp.cpu_count())
pool.map(my_func, range(0, 100))
if __name__ == "__main__":
main()
You can send send multiple get() requests in a loop as follows:
for i in range(100):
driver.get(https://page-views.glitch.me/badge?page_id=page.id)
I've tried to create a scraper using python in combination with Thread to make the execution time faster. The scraper is supposed to parse all the shop names along with their phone numbers traversing multiple pages.
The script is running without any issues. As I'm very new to work with Thread, I can hardly understand I'm doing it in the right way.
This is what I've tried so far with:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
for pagelink in [url.format(page) for page in range(20)]:
response = requests.get(pagelink).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
thread = threading.Thread(target=get_information, args=(link,))
thread.start()
thread.join()
The problem being I can't find any difference in time or performance whether I run the above script using Thread or without using Thread. If I'm going wrong, how can I execute the above script using Thread?
EDIT: I've tried to change the logic to use multiple links. Is it possible now? Thanks in advance.
You can use Threading to scrape several pages in paralel as below:
import requests
from lxml import html
import threading
from urllib.parse import urljoin
link = "https://www.yellowpages.com/search?search_terms=coffee&geo_location_terms=Los%20Angeles%2C%20CA&page={}"
def get_information(url):
response = requests.get(url).text
tree = html.fromstring(response)
for title in tree.cssselect("div.info"):
name = title.cssselect("a.business-name span[itemprop=name]")[0].text
try:
phone = title.cssselect("div[itemprop=telephone]")[0].text
except Exception: phone = ""
print(f'{name} {phone}')
threads = []
for url in [link.format(page) for page in range(20)]:
thread = threading.Thread(target=get_information, args=(url,))
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
Note that sequence of data will not be preserved. It means that if to scrape pages one by one sequence of extracted data will be:
page_1_name_1
page_1_name_2
page_1_name_3
page_2_name_1
page_2_name_2
page_2_name_3
page_3_name_1
page_3_name_2
page_3_name_3
while with Threading data will be mixed:
page_1_name_1
page_2_name_1
page_1_name_2
page_2_name_2
page_3_name_1
page_2_name_3
page_1_name_3
page_3_name_2
page_3_name_3
i am trying out the following code to learn threading in python.
import urllib.request
import re
import threading
from sys import argv, exit
if len(argv[1:])==0:
exit("You haven't entered any arguments. Try again.")
else:
comps=argv[1:]
def extr(comp):
url = 'http://finance.yahoo.com/q?s='+comp
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
respData = resp.read()
print (re.findall(r'<span id="yfs_l84_[^.]*">(.*?)</span>',str(respData)))
for x in comps:
t = threading.Thread(extr(x))
t.daemon = True
t.start()
I get the right result but one after the other and not at once. Am I missing something?
t = threading.Thread(extr(x)) is the problem. You are calling extr(x), and passing the result of that to the Thread constructor. Try Thread(target=extr, args=(x,)).
You'll then need to use something like https://docs.python.org/2/library/queue.html to allow threads to pass the result data back to the main thread before they terminate. You'd create the queue in the main thread, and pass it as an argument into each subthread.
I'm new to Twisted and after finally figuring out how the deferreds work I'm struggling with the tasks. What I want to achieve is to have a script that sends a REST request in a loop, however if at some point it fails I want to stop the loop. Since I'm using callbacks I can't easily catch exceptions and because I don't know how to stop the looping from an errback I'm stuck.
This is the simplified version of my code:
def send_request():
agent = Agent(reactor)
req_result = agent.request('GET', some_rest_link)
req_result.addCallbacks(cp_process_request, cb_process_error)
if __name__ == "__main__":
list_call = task.LoopingCall(send_request)
list_call.start(2)
reactor.run()
To end a task.LoopingCall all you need to do is call the stop on the return object (list_call in your case).
Somehow you need to make that var available to your errback (cb_process_error) either by pushing it into a class that cb_process_error is in, via some other class used as a pseudo-global or by literally using a global, then you simply call list_call.stop() inside the errback.
BTW you said:
Since I'm using callbacks I can't easily catch exceptions
Thats not really true. The point of an errback to to deal with exceptions, thats one of the things that literally causes it to be called! Check out my previous deferred answer and see if it makes errbacks any clearer.
The following is a runnable example (... I'm not saying this is the best way to do it, just that it is a way...)
#!/usr/bin/python
from twisted.internet import task
from twisted.internet import reactor
from twisted.internet.defer import Deferred
from twisted.web.client import Agent
from pprint import pprint
class LoopingStuff (object):
def cp_process_request(self, return_obj):
print "In callback"
pprint (return_obj)
def cb_process_error(self, return_obj):
print "In Errorback"
pprint(return_obj)
self.loopstopper()
def send_request(self):
agent = Agent(reactor)
req_result = agent.request('GET', 'http://google.com')
req_result.addCallbacks(self.cp_process_request, self.cb_process_error)
def main():
looping_stuff_holder = LoopingStuff()
list_call = task.LoopingCall(looping_stuff_holder.send_request)
looping_stuff_holder.loopstopper = list_call.stop
list_call.start(2)
reactor.callLater(10, reactor.stop)
reactor.run()
if __name__ == '__main__':
main()
Assuming you can get to google.com this will fetch pages for 10 seconds, if you change the second arg of the agent.request to something like http://127.0.0.1:12999 (assuming that port 12999 will give a connection refused) then you'll see 1 errback printout (which will have also shutdown the loopingcall) and have a 10 second wait until the reactor shuts down.