I'm looking to switch existing PHP code over to Python using the Requests library. The PHP code sends thousands of GET requests to an API to get needed data. The API limits GET requests to one every 6 seconds per IP. We have numerous IP addresses in order to pull faster. The faster the better in this case.
My question is is there a way to send the GET request from different IP addresses using the Requests library? I'm also open to using different libraries in Python or different methods to replace the IP addresses.
The current code makes use of curl_multi_exec with the CURLOPT_INTERFACE setting.
As far as code goes, I don't necessarily need code examples. I'm looking for more of a direction or option that will allow such features in Python. I would prefer not post code, but if its necessary, let me know.
Thanks!
I don't believe Requests supports setting the outbound interface.
There is a Python cURL library, though.
Related
Is there a way to receive and process packets intercepted in http-toolkit programmatically using python?
Is there any internal API I access?
Ideally I would like to receive the packets in a JSON or HAR format.
Within HTTP Toolkit itself, this isn't possible right now, but it is planned in future. You can +1 on the issue to vote for it here: https://github.com/httptoolkit/httptoolkit/issues/37. With that, you'd be able to add your own scripts within HTTP Toolkit which could process or store packets elsewhere any way you like, including sending them to a Python process.
In the meantime, this may be possible using Mockttp. Mockttp is the internals of HTTP Toolkit as an open-source JavaScript library that you can use to build your own fully scriptable proxy, and once that's working you can easily add logic to forward packets to Python on top of that. There's a getting started guide here: https://httptoolkit.tech/blog/javascript-mitm-proxy-mockttp/.
I was making some test http requests using Python's request library. When searching for Walmart's Canadian site (www.walmart.ca), I got this:
How do servers like Walmart's detect that my request is being made programatically? I understand browsers send all sorts of metadata to the server. I was hoping to get a few specific examples of how this is commonly done. I've found a similar question, albeit related to Selenium Web Driver, here where it claims that there are some vendors that provide this service but I was hoping to get something a bit more specific.
Appreciate any insights, thanks.
As mentioned in the comments, a real browsers send many different values - headers, cookies, data. It reads from server not only HTML but also images, CSS, JS, fonts. Browser can also run JavaScript which can get other information about browser - version, extensions, data in local storage, etc (i.e how you move mouse). And real human loads/visits pages with random delays and in rather in random order. And all these elements can be used to detect a script. Servers may use very complex systems even Machine Learning (Artificial Intelligence) and use data from few mintues or hours to compare your behavior.
I have a nodejs server setup on AWS with mongoDB. I want to access the database contents using GET method. There is another application in python which needs to access this database present on AWS. I searched on the internet and came across PycURL but I am not getting how to use it exactly. How to approach with pycURL or what can be an alternate solution?
You can build your restful API that is going to handle those GET requests. You have awesome tutorial (with example that you want on bottom):
https://scotch.io/tutorials/build-a-restful-api-using-node-and-express-4
Edit: If you want phyton code for GET requests there is awesome answer here: Simple URL GET/POST function in Python
Edit 2: Example of how would this work. You first need to code your API how to handle GET request and on what route (example: http://localhost:5000/api/getUsers). Than you want to make GET request to that route using Phyton:
Example:
r = requests.get(url="http://localhost:5000/api/getUsers")
I had a similar problem a while ago, there is a tutorial here here. It can lead you towards your intended direction, the drawback may be that in the tutorial, to issue the http request (if I remember correctly), they used postman but I'm sure you can still use PyCurl.
We are seeing very poor performance while using MITMProxy in Python. We are custom forwarding requests using the requests Python library.
Our program uses the script mode on MITMProxy to create a custom request based on the request from a client and then return the response. So, basically, for every request made to the proxy, a new request object is built with requests, then forwarded and then returned.
How can I increase the performance of MITMProxy when using it to forward requests?
I fixed this issue with Juan a while ago, but having received similar questions lately, let me leave the solution here for reference:
mitmproxy has a single flow primitive, so when an inline script is handling something, other requests block. Scripts can be run threaded by using the libmproxy.script.concurrent decorator. For more details, have a look at the docs.
(Full disclosure: I authored this feature)
When doing a scrape of a site, which would be preferable: using curl, or using Python's requests library?
I originally planned to use requests and explicitly specify a user agent. However, when I use this I often get an "HTTP 429 too many requests" error, whereas with curl, it seems to avoid that.
I need to update metadata information on 10,000 titles, and I need a way to pull down the information for each of the titles in a parallelized fashion.
What are the pros and cons of using each for pulling down information?
Since you want to parallelize the requests, you should use requests with grequests (if you're using gevent, or erequests if you're using eventlet). You may have to throttle how quickly you hit the website though since they may do some ratelimiting and be refusing you for requesting too much in too short a period of time.
Using requests would allow you to do it programmatically, which should result in a cleaner product.
If you use curl, you're doing os.system calls which are slower.
I'd go for the in-language version over an external program any day, because it's less hassle.
Only if it turns out unworkable would I fall back to this. Always consider that people's time is infinitely more valuable than machine time. Any "performance gains" in such an application will probably be swamped by network delays anyway.