How to check if remote file exits behind a proxy - python

I writing app that connect to a web server (I am the owner of he server) sends information provided by the user, process that information and send result back to the application. The time needed to process the results depends on the user request (from few seconds to a few minutes).
I use a infinite loop to check if the file exist (may be there is a more intelligent approach... may be I could estimated the maximum time a request could take and avoid using and infinite loop)
the important part of the code looks like this
import time
import mechanize
br = mechanize.Browser()
br.set_handle_refresh(False)
proxy_values={'http':'proxy:1234'}
br.set_proxies(proxy_values)
While True:
try:
result=br.open('http://www.example.com/sample.txt').read()
break
except:
pass
time.sleep(10)
Behind a proxy the loop never ends, but if i change the code for something like this,
time.sleep(200)
result=br.open('http://www.example.com/sample.txt').read()
i.e. I wait enough time to ensure that the file is created before trying to read it, I indeed get the file :-)
It seems like if mechanize ask for a file that does not exits everytime mechanize will ask again I will get no file...
I replicated the same behavior using Firefox. I ask for a non-existing file then I create that file (remember I am the owner of the server...) I can not get the file.
And using mechanize and Firefox I can get deleted files...
I think the problem is related to the Proxy cache, I think I can´t delete that cache, but may be there is some way to tell the proxy I need to recheck if the file exists...
Any other suggestion to fix this problem?

The simplest solution could be to add a (unused) GET parameter to avoid caching the request.
ie:
i = 0
While True:
try:
result=br.open('http://www.example.com/sample.txt?r=%d' % i).read()
break
except:
i += 1
time.sleep(10)
The extra parameter should be ignored by the web application.
A HTTP HEAD is probably the correct way to do this, see this question for a example.

Related

Python 'requests' GET in loop eventually throws [WinError 10048]

Disclaimer: This is similar to some other questions relating to this error but my program is not using any multi-threading/processing and I'm working with the 'requests' module instead of raw socket commands so none of the solutions I saw related to my issue.
I have a basic status-checking program running Python 3.4 on Windows that uses a GET request to pull some data off a status site hosted by a number of servers I have to keep watch over. The core code is setup like this:
import requests
import time
URL_LIST = [some, list, of, the, status, sites] # https:// sites
session = requests.session()
previous_data = ""
while 1:
data = ""
for url in URL_LIST:
headers = {'X-Auth-Token': Associated_Auth_Token}
try:
status = session.get(url, headers=headers).json()['status']
except ConnectionError:
status = "SERVER DOWN"
data += "%s \t%s\n" % (url, status)
if data != previous_data:
print(data)
previous_data = data
time.sleep(15)
...which typically runs just fine for hours (this script is intended to run 24/7 and has additional logging built in I left out here for simplicity and relevance) but eventually it crashes and throws the error mentioned in the title:
[WinError 10048] Only one usage of each socket address (protocol/network address/port) is normally permitted
The servers I'm requesting from are notoriously slow at times (and sometimes go down entirely, hence the try/except) so my inclination would be that after looping this over and over eventually a request has not fully finished before the next request comes through and Windows tries to step on itself, but I don't see how that could happen with my code since it iterates serially through the URLs.
Also, if this is a TIME_WAIT issue as some other related posts ran into, I'd rather not have to wait for that to finish since I'd like to update every 15 seconds or better, so then I considered closing and opening a new requests session every so often since it typically works fine for hours before hitting a snag, but based off Lukasa's comment here:
To avoid getting sockets in TIME_WAIT, the best thing to do is to use a single Session object at as high a scope as you can and leave it open for the lifetime of your program. Requests will do its best to re-use the sockets as much as possible, which should prevent them lapsing into TIME_WAIT
...it sounds that is not a good idea - though when he says 'lifetime of your program' he may not intend the statement to include 24/7 use as in my case.
So instead of blindly trying things and waiting some number of hours for the program to crash again so I can see if the error changes, I wanted to consult the wealth of knowledge here first to see if anyone can see what's going wrong and knows how I should fix it.
Thanks!

Issue with sending POST requests using the library requests

import requests
while True:
try:
posting = requests.post(url,json = data,headers,timeout = 3.05)
except requests.exceptions.ConnectionError as e:
continue
# If a read_timeout error occurs, start from the beginning of the loop
except requests.exceptions.ReadTimeout as e:
continue
a link to more code : Multiple accidental POST requests in Python
This code is using requests library to perform POST requests indefinitely. I noticed that when try fails multiple of times and the while loop starts all over multiple of times, that when I can finally send the post request, I find out multiple of entries from the server side at the same second. I was writing to a txt file at the same time and it showed one entry only. Each entry is 5 readings. Is this an issue with the library itself? Is there a way to fix this?! No matter what kind of conditions that I put it still doesn't work :/ !
You can notice the reading at 12:11:13 has 6 parameters per second while at 12:14:30 (after the delay, it should be every 10 seconds) it is a few entries at the same second!!! 3 entries that make up 18 readings in one second, instead of 6 only!
It looks like the server receives your requests and acts upon them but fails to respond in time (3s is a pretty low timeout, a load spike/paging operation can easily make the server miss it unless it employs special measures). I'd suggest to
process requests asynchronously (e.g. spawn threads; Asynchronous Requests with Python requests discusses ways to do this with requests) and do not use timeouts (TCP has its own timeouts, let it fail instead).
reuse the connection(s) (TCP has quite a bit of overhead for connection establishing/breaking) or use UDP instead.
include some "hints" (IDs, timestamps etc.) to prevent the server from adding duplicate records. (I'd call this one a workaround as the real problem is you're not making sure if your request was processed.)
From the server side, you may want to:
Respond ASAP and act upon the info later. Do not let pending action prevent answering further requests.

put_async, wait, and callbacks, how to speed up page redirect after an asynchronous file upload

So I have a BlobstoreUploadHandler class that, uses put_async and wait like so:
x = Model.put_async()
x.wait()
then proceeds to pass some data up front to javascript, so that the user is redirected to the class serving their file upload, it does this like so:
redirecthref = '%s/serve/%s' % (
self.request.host_url, Model.uploadid)
self.response.headers['Content-Type'] = 'application/json'
obj = { 'success' : True, 'redirect': redirecthref }
self.response.write(json.dumps(obj))
this all works well and good, however, it takes a CRAZY amount of time for this redirect to happen, we're talking minutes, and while the file is uploading, the page is completely frozen. I've noticed I am able to access the link that javascript would redirect to even while the upload is happening and the page is frozen, so my question is, what strategies can I pursue to make the redirect happen right when the url becomes available? Is this what the 'callback' parameter of put_async is for, or is this where I want to look into url_fetch.
Im pretty new to this and any and all help is appreciated. Thanks!
UPDATE:
So I've figured out that the upload is slow for several reasons:
I should be using put() rather than put_aync(), which I've found does speed up the upload time, however something is breaking and it's giving me a 500 error that looks like:
POST http://example.com/_ah/upload/AMmfu6au6zY86nSUjPMzMmUqHuxKmdTw1YSvtf04vXFDs-…tpemOdVfHKwEB30OuXov69ZQ9cXY/ALBNUaYAAAAAU-giHjHTXes0sCaJD55FiZxidjdpFTmX/ 500 (Internal Server Error)
It still uploads both the resources, but the redirect does not work. I believe this is happening on the created upload_url, which is created using
upload_url = blobstore.create_upload_url('/upload')
All that aside, even using put() instead of put_async(), the wait() method is still taking an exorbitant amount of time.
If I remove the x.wait(), the upload will still happen, but the redirect gives me:
IndexError: List index out of range
this error is thrown on the following line of my /serve class Handler
qry = Model.query(Model.uploadid == param).fetch(1)[0]
So in short, I believe the fastest way to serve an entity after upload is to take out x.wait() and instead use a try: and except: on the query, so that it keeps trying to serve the page until it doesnt get a listindex error.
Like I said, im pretty new to this so actually making this happen is a little over my skill level, thus any thoughts or comments are greatly appreciated, and I am always happy to offer more in the way of code or explanation. Thanks!
async calls are about sending something to the background when you don't REALLY care about when it finishes. Seems to me you are looking for a put.
By definition a put_async isn't meant to finish fast. It sends something to the back for when your instance has time to do it. You're looking for a put I think. It'll freeze your application the same way your wait is doing, but instead of waiting a LONG time for the async to finish, it'll start working on it right away.
as said in the async documentation (https://developers.google.com/appengine/docs/java/datastore/async):
However, if your application needs the result of the get() plus the result of a Query to render the response, and if the get() and the Query don't have any data dependencies, then waiting until the get() completes to initiate the Query is a waste of time.
Doesn't seem to be what you're doing. You're using an async call in a purely synced way. It WILL take longer to complete than a simple put. Unless there is some reason to push the "put" to take longer, you shouldn't use async
Looking back, I wanted to circle up on this since I solved this one shortly after posting about it. What I discovered was that there was no way real way to speed up the upload, other than using put instead of put_async of course.
But there was a tricky way to access the blob in my redirect url, other than through the Model.uploadid which was not guaranteed to be consistently uploaded by the time the redirect occurred.
The solution was to simply access the blob using the .keys() method of my upload object and to pass that into the redirect_href, instead of the Model.uploadid
redirecthref = '%s/serve/%s' % (self.request.host_url, self.get_uploads(‘my_upload_object’)[0].key())
Not sure why the .keys() lookup seemed to bypass the whole upload process, but this seemed to work for me.
Thanks,

Python on the web: executing code as it's processed?

I made a python application that I'd like to deploy to the web. I'm on a Mac, so I enabled the web server and dropped it in my cgi-bin, and it works fine. The problem is, the application does some intensive computations, and I would really like to let the user know what's going on while it's executing.
Even though i have print statement scattered throughout the code, it doesn't output anything to my browser until the entire thing is done executing. Is there any way I can fix this to execute code as it's processed?
Instead of 'print', you might want to try
sys.stdout.write('something something something')
sys.stdout.flush()
That'll ensure that the web server isn't waiting for a buffer to fill up.
If sys.stdout.flush() didn't do the trick, the problem is likely to be resolved by chunked-encoding transfer.
To give a little bit of background, chunked-encoding defines a mechanism where up-front the server will tell the client 'My data stream has no limit', and as an efficiency the data is transferred in chunks as opposed to just streaming content willy-nilly.
Here's a simple example, the important is how you send the data and the headers you use.
Another aspect of this is what the browser actually does with the data as it comes in, even if your cgi is sending data to the browser it might just sit on it until it's done.
With the following example, curl shows each 'chunk' being downloaded correctly in a stream, Safari still hangs waiting for the CGI to complete.
#!/usr/bin/python
import time
import sys
def chunk(msg=""):
return "\r\n%X\r\n%s" % ( len( msg ) , msg )
sys.stdout.write("Transfer-Encoding: chunked\r\n")
sys.stdout.write("Content-Type: text/html\r\n")
for i in range(0,1000):
time.sleep(.1)
sys.stdout.write( chunk( "%s\n" % ( 'a' * 80 ) ) )
sys.stdout.flush()
sys.stdout.write(chunk() + '\r\n')
So if you just connect to this CGI with your browser, yeah, you won't see any changes - however if you use AJAX techniques and setup a handler every time you get data you'll be able to 'stream' it as it comes in.
Probably the best approach to something like this to seperate your concerns. Make an ajax-drive "console" type display, that for instance will poll a log file, which is written to in the worker process.

Detecting timeout erros in Python's urllib2 urlopen

I'm still relatively new to Python, so if this is an obvious question, I apologize.
My question is in regard to the urllib2 library, and it's urlopen function. Currently I'm using this to load a large amount of pages from another server (they are all on the same remote host) but the script is killed every now and then by a timeout error (I assume this is from the large requests).
Is there a way to keep the script running after a timeout? I'd like to be able to fetch all of the pages, so I want a script that will keep trying until it gets a page, and then moves on.
On a side note, would keeping the connection open to the server help?
Next time the error occurs, take note of the error message. The last line will tell you the type of exception. For example, it might be a urllib2.HTTPError. Once you know the type of exception raised, you can catch it in a try...except block. For example:
import urllib2
import time
for url in urls:
while True:
try:
sock=urllib2.urlopen(url)
except (urllib2.HTTPError, urllib2.URLError) as err:
# You may want to count how many times you reach here and
# do something smarter if you fail too many times.
# If a site is down, pestering it every 10 seconds may not
# be very fruitful or polite.
time.sleep(10)
else:
# Success
contents=sock.read()
# process contents
break # break out of the while loop
The missing manual of urllib2 might help you

Categories