The task is:
1)send an http get to url based on a parameter
2)Modify the response based on the same parameter
3)send an http post to url based on the same parameter
I am currently doing this through requests library, but it takes a lot of time to do this one by one and it can be upto 20,000.
I tried multiprocessing but for some reason it hangs after sending 5000-10000 get and post.
I read about grequest but it says there
Order of these responses does not map to the order of the requests you send out..I need the order as i have to modify each reponse based on the get i sent.
What is the best option here?I have read about threading,tornado too but as i have messed up my first approach with multiprocessing i want to be sure before getting onto it again
Here is a solution that allows you to use grequest's imap (which is theoretically faster than grequest's map function) and know an index to map a response to a request. Credit to a question asked on the project's GitHub issues.
from functools import partial
def callback(index, response, **kwargs):
response.image_index = index
rs = [
grequests.get(
url,
callback=partial(callback, index)
)
for index, url in enumerate(urls)
]
You should be able to tailor this to suit your needs.
EDIT:
I used this successfully with hooks.
grequests.get(
url,hooks={'response': partial(process_response, index)})
Related
I have a REST API that has a database with table with two columns, product_id and server_id, that it serves product_ids to specific servers which request the data(based on the server_id from table).
Let's say I have three servers with server_ids 1,2 and 3.
My design is like this: /products/server_id/1 and with GET request I get json list of product_ids with server_id = 1, similarly /products/server_id/2 would output list of product_ids for server_id = 2.
Should I remove these routes and make a requirement to send POST request with instructions to receive product_ids for specific server_id in /products route only?
For example sending payload {"server_id":1} would yield a response of list of product_ids for server_id = 1.
Should I remove these routes and make a requirement to send POST request with instructions to receive product_ids for specific server_id in /products route only?
Not usually, no.
GET communicates to general purpose components that the semantics of the request message are effectively read only (see "safe"). That affordance alone makes a number of things possible; for instance, spiders can crawl and index your API, just as they would for a web site. User agents can "pre-fetch" resources, and so on.
All of that goes right out the window when you decide to use POST.
Furthermore, the URI itself serves a number of useful purposes - caches use the URI as the primary key for matching a request. Therefore we can reduce the load on the origin server by re-using representations have have been stored using a specific identifier. We can also perform magic like sticking that URI into an email message, without the context of any specific HTTP request, and the receiver of the message will be able to GET that identifier and fetch the resource we intend.
Again, we lose all of that when the identifying information is in the request payload, rather than in the identifier metadata where it belongs.
That said, we sometimes do use the payload for identifying information, as a work around: for example, if we need so much identifying information that we start seeing 414 URI Too Long responses, then we may need to change our interaction protocol to use a POST request with the identifying information in the payload (losing, as above, the advantages of using GET).
An online example of this might be something like an HTML validator, that accepts a candidate document and returns a representation of the problems found. That's effectively a read only action, but in the general case an HTML document is too long to comfortably fit in the target-uri of an HTTP request.
So we punt.
In a hypermedia api, like those used on the world wide web, we can get away with it, because the HTTP method to use is provided by the server as part of the metadata of the form itself. You as the client don't need to know the server's preferred semantics, you just need to know how to process the form data.
For instance, as I type this answer into my browser, I don't need to know what the target URI is, or what HTTP method is going to be used, because the browser already knows what to do (based on the HTML and whatever scripts are running "on demand").
In REST APIs, POST requests should only be used in order to create new resource, so in order to retrieve data from server, the best practice is to perform a GET request.
If you want to load products 1,2,4,8 on server 9 for example, you can use this kind of request :
GET https://website/servers/9/products/1,2,4,8
On server side, if products value contains a coma separated list, then return an array with all results, if not return just an array with only one item in order to keep consistency between calls.
In case you need to get all products, you can keep only the following url :
GET https://website/servers/9/products
As there is no id provided in products parameter, then the server should return all existing products for requested server parameter.
Note : in case of big amount of results, they must by paginated.
Short version: Can I just use the Requests module for POST, GET, and DELETE?
I'm trying to use the Pinterest REST API. (Pinterest API Explorer)
I'm going the simple route and just manually got my authentication token via oauth, so basically all I need to know how to do is POST, GET, and DELETE to a specific URL and also include the parameters, then return a json.
I really only need three API functions, list authorized user's followers (GET), follow user (POST), and unfollow user (DELETE). The only param I need for any of those is my access_token that I got manually.
It seems like a simple problem, but there's about 5 python Pinterest API wrappers, none of them complete, some of them not working at all. I've looked at the pycurl, httplib, and requests modules. They all look like they have a simple enough method for GET, but it gets more complicated with POST and maybe DELETE. It seems like it should be super simple, a function that takes a method (POST/GET/DELETE/etc), a url, and a set of parameters, so why is it more complicated than that? If it were that easy, I don't understand why all these API wrappers would be half done since it theoretically should be as simple as calling a function with those 3 parameters (with an array for the 3rd parameter) for every function in the API.
In the Requests python package, there's this function under the RequestMethods class:
def request(self, method, url, fields=None, headers=None, **urlopen_kw)
Looks like I understand everything except what the headers are and the **urlopen_kw, but I think it should work without those to variables, correct?
I'd appreciate it if someone could point me in the right direction.
From the docs:
Here is an example of doing a PUT request using Request:
import urllib.request
DATA = b'some data'
req = urllib.request.Request(url='http://localhost:8080', data=DATA,method='PUT')
with urllib.request.urlopen(req) as f:
pass
print(f.status)
print(f.reason)
In your case the method would be 'POST', 'Delete' or whatever you like.
If you want to make more complex requests, have a look at this guide for the httplib2 library - it's worth reading.
I want to make a JSON request with the Python library requests where I only obtain certain JSON objects.
I know that it is really easy to process the JSON object obtained to only focus in the needed information, but that would throttle the request efficiency (in case it is done repeatedly).
As said, I know this is a possibility:
url = 'www.foo.com'
r = requests.get(url).json()
#Do something with r[3]['data4'], the only one who is going to be used.
But how could I directly only obtain r[3]['data4'] from the request?
Short Answer
To answer your question no, you can't but to understand why you need to know what is happening behind the scenes.
Behind the scenes
When you make a request such as r = requests.get('www.foo.bar') you are making a request to the server and you are viewing the result of that request when you do r.json(). This means that you cannot just get r[3]['data'] as you are parsing what the server sends to you unless the server only sends r[3]['data']. It may be possible to filter out everything else apart from that in the response processing but I am unaware of how to do it.
You can't, if the server does not allow it. If the target server allows you to specify fields you want then you can send that field list in your request and server will return you only those fields in JSON. Otherwise your will have to parse full JSON response and get your desired fields.
I notice the Python library for Instagram and the API are a bit different.
The API request from documentation:
/tags/tag-name/media/recent
Found at: https://instagram.com/developer/endpoints/tags/
That page says 3 parameters:
COUNT: Count of tagged media to return.
MIN_TAG_ID: Return media before this min_tag_id.
MAX_TAG_ID: Return media after this max_tag_id.
My code:
recent_media, next_ = api.tag_recent_media(50, 10000, "cars")
The library documentation: https://github.com/Instagram/python-instagram
Documented syntax:
api.tag_recent_media(count, max_tag_id, tag_name)
The problems:
Regardless of how big of count I put in, it always returns 33. I see nothing about 33 on the limits page: https://instagram.com/developer/limits/
Regardless of the max_tag_id I use, it has no effect.
While the Python library only had itself marked as unmaintained fairly recently, most of the commits leading up to the date of your question were either trivial or only impacted the examples/README.
Yes, the count parameter appears to be limited, despite the lack of documentation, at 33. Apparently, some Instagram API endpoints used to accept -1 and return all relevant media, though this was removed/corrected quite a while ago.
Also, thanks to the new algorithmic timeline (which is indeed applied to this endpoint, which I was attempting to use for work), you are likely to receive results that have seen recent activity but are years old. I've gotten results from 2013 in my first batch of 33 for at least one tag.
The only parameter I've found useful on that endpoint is count, though as you've seen, you can only get a minor bump in returned data.
I suggest just sticking with the requests library for accessing IG's API from Python. For pagination, you can use the next_url value in the pagination dictionary (it's not present if there's no next page), without having to pass any extra parameters or attempting to deal with the obnoxiously obtuse *_tag_id parameters.
Basic API usage with requests:
import requests
response = requests.get(
url='https://api.instagram.com/v1/tags/{tagname}/media/recent'.format(tagname='testing'),
params={
'access_token': YOUR_TOKEN,
'count': 33,
}
).json()
The variable cherrypy.request.params as it is described in the API contains the query string and the POST variables in a dictionary. However combing over this, it seems that it contains every variable received after processing the full request URI to pull the GET data. This then becomes indistinguishable from POST data in the dictionary.
There seems to be no way to tell the difference, or perhaps I am wrong.
Can someone please enlighten me as to how to use purely posted data and ignore any data in the query string beyond the request URI. And yes I am aware I can find out whether it was a POST or GET request but this does not stop forgery in requests to URIs containing GET data in a POST request.
>http://localhost:8080/testURL/part2?test=1
>POST username = test
"cherrypy.request.params" has 2 variables
test = 1
username=test
The docs aren't very clear on this point, but starting in CherryPy 3.2, you can reference request.body.params to obtain just the POST/PUT params. In 3.2 and below, try request.body_params. See http://docs.cherrypy.org/dev/refman/_cprequest.html#cherrypy._cprequest.Request.body_params