After a long time writing this small script, it finally worked; or better to say, it's almost done. I am having just a small problem. I am not able to send COOKIES as add.headers in the urllib.request. What am I doing wrong? I need to send a given COOKIE or the website will not allow me to download the .pdf file, but I believe I'm doing this the wrong way.
Here is my code; please let me know what is wrong:
import os
import threading
import urllib.request
from queue import Queue
class Downloader(threading.Thread):
"""Threaded File Downloader"""
def __init__(self, queue):
"""Initialize the thread"""
threading.Thread.__init__(self)
self.queue = queue
def run(self):
"""Run the thread"""
while True:
# gets the url from the queue
url = self.queue.get()
# download the file
self.download_file(url)
# send a signal to the queue that the job is done
self.queue.task_done()
def download_file(self, url):
"""Download the file"""
handle = urllib.request.urlopen(url)
faturanum = 20184009433300
fatura = str(faturanum)
fname = fatura + ".pdf"
handle.addheaders = [('Cookie', 'ASP.NET_SessionId=zstuzktl0x1laoqhxgkm4ign')]
with open(fname, "wb") as f:
while True:
chunk = handle.read(1024)
if not chunk: break
f.write(chunk)
def main(urls):
"""
Run the program
"""
queue = Queue()
# create a thread pool and give them a queue
for i in range(5):
t = Downloader(queue)
t.setDaemon(True)
t.start()
# give the queue some data
for url in urls:
queue.put(url)
# wait for the queue to finish
queue.join()
if __name__ == "__main__":
urls = ["https://pagamentodigitaltsting.com/Fatura/Pdf?nrFatura=20193981821"]
main(urls)
What is wrong that I can not send cookies with the request? As you can see, the website is being served through https. Once the page loads, it renders a pdf file.
Related
I haven't done twisted programming in a while so I'm trying to get back into it for a new project. I'm attempting to set up a twisted client that can take a list of servers as an argument, and for each server it sends an API GET call and writes the return message to a file. This API GET call should be repeated every 60 seconds.
I've done it successfully with a single server using Twisted's agent class:
from StringIO import StringIO
from twisted.internet import reactor
from twisted.internet.protocol import Protocol
from twisted.web.client import Agent
from twisted.web.http_headers import Headers
from twisted.internet.defer import Deferred
import datetime
from datetime import timedelta
import time
count = 1
filename = "test.csv"
class server_response(Protocol):
def __init__(self, finished):
print "init server response"
self.finished = finished
self.remaining = 1024 * 10
def dataReceived(self, bytes):
if self.remaining:
display = bytes[:self.remaining]
print 'Some data received:'
print display
with open(filename, "a") as myfile:
myfile.write(display)
self.remaining -= len(display)
def connectionLost(self, reason):
print 'Finished receiving body:', reason.getErrorMessage()
self.finished.callback(None)
def capture_response(response):
print "Capturing response"
finished = Deferred()
response.deliverBody(server_response(finished))
print "Done capturing:", finished
return finished
def responseFail(err):
print "error" + err
reactor.stop()
def cl(ignored):
print "sending req"
agent = Agent(reactor)
headers = {
'authorization': [<snipped>],
'cache-control': [<snipped>],
'postman-token': [<snipped>]
}
URL = <snipped>
print URL
a = agent.request(
'GET',
URL,
Headers(headers),
None)
a.addCallback(capture_response)
reactor.callLater(60, cl, None)
#a.addBoth(cbShutdown, count)
def cbShutdown(ignored, count):
print "reactor stop"
reactor.stop()
def parse_args():
usage = """usage: %prog [options] [hostname]:port ...
Run it like this:
python test.py hostname1:instanceName1 hostname2:instancename2 ...
"""
parser = optparse.OptionParser(usage)
_, addresses = parser.parse_args()
if not addresses:
print parser.format_help()
parser.exit()
def parse_address(addr):
if ':' not in addr:
hostName = '127.0.0.1'
instanceName = addr
else:
hostName, instanceName = addr.split(':', 1)
return hostName, instanceName
return map(parse_address, addresses)
if __name__ == '__main__':
d = Deferred()
d.addCallbacks(cl, responseFail)
reactor.callWhenRunning(d.callback, None)
reactor.run()
However I'm having a tough time figuring out how to have multiple agents sending calls. With this, I'm relying on the end of the write in cl() ---reactor.callLater(60, cl, None) to create the call loop. So how do I create multiple call agent protocols (server_response(Protocol)) and continue to loop through the GET for each of them once my reactor is started?
Look what the cat dragged in!
So how do I create multiple call agent
Use treq. You rarely want to get tangled up with the Agent class.
This API GET call should be repeated every 60 seconds
Use LoopingCalls instead of callLater, in this case it's easier and you'll run into less problems later.
import treq
from twisted.internet import task, reactor
filename = 'test.csv'
def writeToFile(content):
with open(filename, 'ab') as f:
f.write(content)
def everyMinute(*urls):
for url in urls:
d = treq.get(url)
d.addCallback(treq.content)
d.addCallback(writeToFile)
#----- Main -----#
sites = [
'https://www.google.com',
'https://www.amazon.com',
'https://www.facebook.com']
repeating = task.LoopingCall(everyMinute, *sites)
repeating.start(60)
reactor.run()
It starts in the everyMinute() function, which runs every 60 seconds. Within that function, each endpoint is queried and once the contents of the response becomes available, the treq.content function takes the response and returns the contents. Finally the contents are written to a file.
PS
Are you scraping or trying to extract something from those sites? If you are scrapy might be a good option for you.
Here is an example read from IBM python threading tutorial. I was going through this URL (http://www.ibm.com/developerworks/aix/library/au-threadingpython/)
#!/usr/bin/env python
import Queue
import threading
import urllib2
import time
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]
queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and prints first 1024 bytes of page
url = urllib2.urlopen(host)
print url.read(1024)
#signals to queue job is done
self.queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
for i in range(5):
t = ThreadUrl(queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
#wait on the queue until everything has been processed
queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
The example here works perfectly. I have been looking for a slightly different modification. Here there are known number of URL's , like for example 5. used range(5) in for loop to iterate over the URL's and process it.
What if, i want to use only '5' threads to process 1000 URL's? so when a thread completes, the completed URL should be removed from queue and new URL needs to be added to queue. But all these should happen by using the same thread.
I can check ,
if self.queue.task_done():
return host
This is the only way i can check if the URL is processed successfully or not. Once returned , i should remove URL from the queue. and add a new URL to queue. How to implement this using queue ?
Thanks,
That code will already do what you describe. If you put 1000 items into the queue instead of 5, they will be processed by those same 5 threads - each one will take an item from the queue, process it, then take a new one as long as there are items left in the queue.
The code reads urls from file and push it to queue assigned to thread and do third party web api call in order to get result that goes to the global list.
When I execute this program sometime it will go to the end and finishes process(printing done) sometime it is stuck and hold the process never finishes.
It seems like if there is an exception("We failed to reach a server") it holds the process and never finishes. I believe that it is thread problem.
Any body can figure it out what is the issue please. Thank you in advance
Here is the code
import threading
import Queue
import hmac
import hashlib
import base64
import urllib2
from urllib2 import Request, urlopen, URLError, HTTPError
import sys
import httplib, urllib, time, random, os
import json
from urlparse import urlparse
import time
#Number of threads
n_thread = 50
#Create queue
queue = Queue.Queue()
domainBlacklistDomain=[]
urlList=[]
def checkBlackList(domain,line):
testUrl = 'https://test.net'
apiToken = 'aaaaa'
secretKey = 'bbbb'
signature_data = 'GET\n/v1/blacklist/lookup\nurl='+domain+'\n\n\n'
digest = hmac.new(secretKey, signature_data, hashlib.sha1).digest()
digest_base64 = base64.encodestring(digest)
req = urllib2.Request('https://test.net/v1/blacklist/lookup?url='+domain)
req.add_header('Authorization', 'Test' + apiToken + ':' + digest_base64)
req.add_header('Connection', 'Keep-Alive')
try:
page = urlopen(req)
length = str(page.info())
if length.find("Content-Length: 0") != -1:
url=str(line.strip())
urlList.append(url)
else:
json_data=json.load(page)
domainBlacklistDomain.append(json_data['url'])
if int(json_data['score']) >10:
print json_data['url']
except HTTPError, e:
print 'The server couldn\'t fulfill the request.'
except URLError, e:
print 'We failed to reach a server.'
class ThreadClass(threading.Thread):
def __init__(self, queue):
threading.Thread.__init__(self)
#Assign thread working with queue
self.queue = queue
def run(self):
while True:
#Get from queue job
host = self.queue.get()
parsed_uri = urlparse(host)
domain = '{uri.scheme}://{uri.netloc}/'.format(uri=parsed_uri)
if "\n" in domain:
domain=domain.replace('\n', '').replace('\r', '')
if domain not in domainBlacklistDomain:
checkBlackList(domain,host):
else:
if domain not in domainBlacklistDomain:
checkBlackList(domain,host):
#signals to queue job is done
self.queue.task_done()
#Create number process
for i in range(n_thread):
t = ThreadClass(queue)
t.setDaemon(True)
#Start thread
t.start()
#Read file line by line
hostfile = open("result_url.txt","r")
for line in hostfile:
#Put line to queue
queue.put(line)
#wait on the queue until everything has been processed
queue.join()
fo=open("final_result.txt","w+b")
for item in urlList:
fo.write("%s\n" %item)
print "done??"
Without reading your code in detail, the issue is almost certainly to do with trying to establish a connection to a non-responsive IP address. The timeouts on these connections can be lengthy.
Try using the socket.setdefaulttimeout() function to establish a global socket timeout.
I'm currently working on a program where multiple threads need to access a single array list. The array functions as a "buffer". One or more threads write into this list and one or more other threads read and remove from this list. My first question is, are array's in Python thread safe? If not, what is a standard approach of dealing with situation?
Try using Threading.lock if there is only one resource.
You should use the queue lib.
here is a good article explaining about threading and queues.
import Queue
import threading
import urllib2
import time
from BeautifulSoup import BeautifulSoup
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]
queue = Queue.Queue()
out_queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue, out_queue):
threading.Thread.__init__(self)
self.queue = queue
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and then grabs chunk of webpage
url = urllib2.urlopen(host)
chunk = url.read()
#place chunk into out queue
self.out_queue.put(chunk)
#signals to queue job is done
self.queue.task_done()
class DatamineThread(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, out_queue):
threading.Thread.__init__(self)
self.out_queue = out_queue
def run(self):
while True:
#grabs host from queue
chunk = self.out_queue.get()
#parse the chunk
soup = BeautifulSoup(chunk)
print soup.findAll(['title'])
#signals to queue job is done
self.out_queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
for i in range(5):
t = ThreadUrl(queue, out_queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
for i in range(5):
dt = DatamineThread(out_queue)
dt.setDaemon(True)
dt.start()
#wait on the queue until everything has been processed
queue.join()
out_queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
You need Locks like ATOzTOA mentioned. You create them by
lock = threading.Lock()
and the threads acquire them if they enter a critical section. After finishing the section, they release the lock. The pythonic way to write this is
with lock:
do_something(buffer)
I read up about threading in the IBM developer sources and found the following example.
In general I understand what happens here, except for one important thing. The work seems to be done in the run() function. In this example run() only prints a line and signals to the queue, that the job is done.
What if I had to return some processed data? I thought about caching it in a global variable, and to access this one later, but this seems not the right way to go.
Any advice?
Perhaps I should clearify: My intuition tells me to add return processed_data to run() right after self.queue.task_done(), but I can't figure out where to catch that return, since it is not obvious to me where run() is called.
#!/usr/bin/env python
import Queue
import threading
import urllib2
import time
hosts = ["http://yahoo.com", "http://google.com", "http://amazon.com",
"http://ibm.com", "http://apple.com"]
queue = Queue.Queue()
class ThreadUrl(threading.Thread):
"""Threaded Url Grab"""
def __init__(self, queue):
threading.Thread.__init__(self)
self.queue = queue
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and prints first 1024 bytes of page
url = urllib2.urlopen(host)
print url.read(1024)
#signals to queue job is done
self.queue.task_done()
start = time.time()
def main():
#spawn a pool of threads, and pass them queue instance
for i in range(5):
t = ThreadUrl(queue)
t.setDaemon(True)
t.start()
#populate queue with data
for host in hosts:
queue.put(host)
#wait on the queue until everything has been processed
queue.join()
main()
print "Elapsed Time: %s" % (time.time() - start)
You can't return a value from run, and in any case there is normally more than one item to process in each thread, so you don't want to return at all after processing one value (see the while loop in each thread).
I would either use another queue to return the results:
queue = Queue.Queue()
out_queue = Queue.Queue()
class ThreadUrl(threading.Thread):
...
def run(self):
while True:
#grabs host from queue
host = self.queue.get()
#grabs urls of hosts and saves first 1024 bytes of page
url = urllib2.urlopen(host)
out_queue.put(url.read(1024))
#signals to queue job is done
self.queue.task_done()
...
def main():
...
#populate queue with data
for host in hosts:
queue.put(host)
#don't have to wait until everything has been processed if we don't want to
for _ in range(len(hosts)):
first_1k = out_queue.get()
print first_1k
or store the result in the same queue:
class WorkItem(object):
def __init__(self, host):
self.host = host
class ThreadUrl(threading.Thread):
...
def run(self):
while True:
#grabs host from queue
work_item = self.queue.get()
host = work_item.host
#grabs urls of hosts and saves first 1024 bytes of page
url = urllib2.urlopen(host)
work_item.first_1k = url.read(1024)
#signals to queue job is done
self.queue.task_done()
...
def main():
...
#populate queue with data
work_items = [WorkItem(host) for host in hosts]
for item in work_items:
queue.put(item)
#wait on the queue until everything has been processed
queue.join()
for item in work_items:
print item.first_1k
the problem with using the queue method is : the order in which the threads may complete is random . Hence the queue item may not necessarily reflect the result of that specific position .
In this example , if google.com gets done before yahoo.com , then the queue has google data before yahoo data, so when retrieving it , the results are incorrect.