Fetch a large chunk of data with TaskQueue - python

I'd like to fetch a large file from an url, but it always raises a DeadLineExceededError, although I have tried with a TaskQueue and put deadline=600 to fetch.
The problem comes from the fetch, so Backends cannot help here : even if i'd launched a backend with a TaskQueue, i'd had 24h to return, but there 'd be still the limit of 10 min with the fetch, ya ?
Is there a way to fetch from a particular offset of file to an other offset ? So could I split the fetch and after put all parts together ?
Any ideas ?
Actually the file to fetch is not really large : between 15 and 30 MB, but the server is likely overwhelmingly slow and constantly fired ...

If the server supports it, you can supply the HTTP Range header to specify a subset of the file that you want to fetch. If the content is being served statically, the server will probably respect range requests; if it's dynamic, it depends on whether the author of the code that generates the response allowed for them.

Related

how do you tell if arcgis request is processed correctly?

my company has an arcgis server, and i've been trying to geocode some address using the python requests packages.
However, as long as the input format is correct, the reponse.status_code is always"200", meaning everything is OK, even if the server didn't process the request properly.
( for example, if the batch size limit is 1000 records, and I sent an json input with 2000 records, it would still return status_code 200, but half of the records will get ignored. )
just wondering if there is a way for me to know if the server process the request properly or not?
A great spot to check is the server logs to start with. They are located in your ArcGIS server manager (https://gisserver.domain.com:6443/arcgis/manager). I would assume it would log some type of warning/info there if records were ignored, but it is not technically an error so there would be no error messages would be returned anywhere.
I doubt you'd want to do this but if you want to up your limit you can follow this technical article on how to do thathttps://support.esri.com/en/technical-article/000012383

Python requests caching authentication headers

I have used python's requests module to do a POST call(within a loop) to a single URL with varying sets of data in each iteration. I have already used session reuse to use the underlying TCP connection for each call to the URL during the loop's iterations.
However, I want to further speed up my processing by 1. caching the URL and the authentication values(user id and password) as they would remain the same in each call 2. Spawning multiple sub-processes which could take a certain number of calls and pass them as a group thus allowing for parallel processing of these smaller sub-processes
Please note that I pass my authentication as headers in base64 format and a pseudo code of my post call would typically look like this:
S=requests.Session()
url='https://example.net/'
Loop through data records:
headers={'authorization':authbase64string,'other headers'}
data="data for this loop"
#Post call
r=s.post(url,data=data,headers=headers)
response=r.json()
#end of loop and program
Please review the scenario and suggest any techniques/tips which might be of help-
Thanks in advance,
Abhishek
You can:
do it as you described (if you want to make it faster then you can run it using multiprocessing) and e.g. add headers to session, not request.
modify target server and allow to send one post request with multiple data (so you're going to limit time spent on connecting, etc)
do some optimalizations on server side, so it's going to reply faster (or just store requests and send you response using some callback)
It would be much easier if you described the use case :)

Checking if a URL exists and is smaller than x bytes without consuming full response

I have a use case where I want to check (from within a python/Django project) if a response to a GET request is smaller than x bytes, if the whole response completes within y seconds and if the response status is 200. The URL being tested is submitted by end users.
Some constraints:-
HEAD request is not acceptable. Simply because some servers might not include a Content-Length, or lie about it, or simply block HEAD requests.
I would not like to consume full GET response body. Imagine end user submitting url to 10GB file... all my server bandwidth(and memory) would be consumed by this.
tl;dr : Is there any python http api that:-
Accepts a timeout for the whole transaction. (I think httplib2 does this)
Response status is 200 (All http libraries do this)
Kills the requests(perhaps with RST) once x bytes have been received to avoid bandwidth starvation.
The x here would probably be in order of KBs, y would be few seconds.
You could open the URL in urllib and read(x+1) from the returned object. If the length of the returned string is x+1, then the resource is larger than x. Then call close() on the object to close the connection, i.e. kill the request. In the worst case, this will fill the OS's TCP buffer, which is something you can not avoid anyway; usually, this should not fetch more than a few kB more than x.
If you furthermore add a Range header to the request, sane servers will close the connection themselves after x+1 bytes. Note that this changes the reply code to 206 Partial Content, or 416 Requested range not satisfiable if the file is too small. Servers which do not support this will ignore the header, so this should be a safe measure.

HEAD request vs. GET request

I always had the idea that doing a HEAD request instead of a GET request was faster (no matter the size of the resource) and therefore had it advantages in certain solutions.
However, while making a HEAD request in Python (to a 5+ MB dynamic generated resource) I realized that it took the same time as making a GET request (almost 27 seconds instead of the 'less than 2 seconds' I was hoping for).
Used some urllib2 solutions to make a HEAD request found here and even used pycurl (setting headers and nobody to True). Both of them took the same time.
Am I missing something conceptually? is it possible, using Python, to do a 'quick' HEAD request?
The server is taking the bulk of the time, not your requester or the network. If it's a dynamic resource, it's likely that the server doesn't know all the header information - in particular, Content-Length - until it's built it. So it has to build the whole thing whether you're doing HEAD or GET.
The response time is dominated by the server, not by your request. The HEAD request returns less data (just the headers) so conceptually it should be faster, but in practice, many static resources are cached so there is almost no measureable difference (just the time for the additional packets to come down the wire).
Chances are, the bulk of that request time is actually whatever process generates the 5+MB response on the server rather than the time to transfer it to you.
In many cases, a web application will still execute the full script when responding to a HEAD request--it just won't send the full body back to the requester.
If you have access to the code that is processing that request, you may be able to add a condition in there to make it handle the request differently depending on the the method, which could speed it up dramatically.

SOAP Method Max Item Number

I was wondering if there was a maximum limit to the number of items that could be received through a SOAP method, or if the server I'm communicating with just has a strange limit.
When using Python's framework Suds, I used a method called getRecords from a database of about 39,000 rows. Unfortunately, when I actually get the results, I only get a list of about 250. Of course, this is data for each row that is necessary for the system to work. I was just curious if the reason why I was being limited was based upon a limit set by SOAP.
Thanks!
There is no such limit. It's just on the server side so that big queries wouldn't hinder the server's work.

Categories