Using curl vs Python requests - python

When doing a scrape of a site, which would be preferable: using curl, or using Python's requests library?
I originally planned to use requests and explicitly specify a user agent. However, when I use this I often get an "HTTP 429 too many requests" error, whereas with curl, it seems to avoid that.
I need to update metadata information on 10,000 titles, and I need a way to pull down the information for each of the titles in a parallelized fashion.
What are the pros and cons of using each for pulling down information?

Since you want to parallelize the requests, you should use requests with grequests (if you're using gevent, or erequests if you're using eventlet). You may have to throttle how quickly you hit the website though since they may do some ratelimiting and be refusing you for requesting too much in too short a period of time.

Using requests would allow you to do it programmatically, which should result in a cleaner product.
If you use curl, you're doing os.system calls which are slower.

I'd go for the in-language version over an external program any day, because it's less hassle.
Only if it turns out unworkable would I fall back to this. Always consider that people's time is infinitely more valuable than machine time. Any "performance gains" in such an application will probably be swamped by network delays anyway.

Related

How does a server detect whether sort of automated http request is being made

I was making some test http requests using Python's request library. When searching for Walmart's Canadian site (www.walmart.ca), I got this:
How do servers like Walmart's detect that my request is being made programatically? I understand browsers send all sorts of metadata to the server. I was hoping to get a few specific examples of how this is commonly done. I've found a similar question, albeit related to Selenium Web Driver, here where it claims that there are some vendors that provide this service but I was hoping to get something a bit more specific.
Appreciate any insights, thanks.
As mentioned in the comments, a real browsers send many different values - headers, cookies, data. It reads from server not only HTML but also images, CSS, JS, fonts. Browser can also run JavaScript which can get other information about browser - version, extensions, data in local storage, etc (i.e how you move mouse). And real human loads/visits pages with random delays and in rather in random order. And all these elements can be used to detect a script. Servers may use very complex systems even Machine Learning (Artificial Intelligence) and use data from few mintues or hours to compare your behavior.

Multi threading script for HTTP status codes

Hi Stackoverflow community,
I would like to create a script that uses multi threading to create a high number of parallel requests for HTTP status codes on a large list of URL's (more than 30k vhosts).
The requests can be executed from the same server where the websites are hosted.
I was using multithreaded curl requests, but I'm not really satisfied with the results I've got. For a complete check of 30k hosts it takes more than an hour.
I am wondering if anyone has any tips or is there a more performant way to do it?
After testing some of the available solutions, the simplest and the fastest way was using webchk
webchk is a command-line tool developed in Python 3 for checking the HTTP status codes and response headers of URLs
The speed was impressive, the output was clean, it parsed 30k vhosts in about 2 minutes
https://webchk.readthedocs.io/en/latest/index.html
https://pypi.org/project/webchk/
If you're looking for parallelism and multi-threaded approaches to make HTTP requests with Python, then you might start with the aiohttp library, or use the popular requests package. Multithreading can be accomplished with multiprocessing, from the standard library.
Here's a discussion of rate limiting with aiohttp client: aiohttp: rate limiting parallel requests
Here's a discussion about making multiprocessing work with requests https://stackoverflow.com/a/27547938/10553976
Making it performant is a matter of your implementation. Be sure to profile your attempts and compare to your current implementation.

How to response different responses to the same multiple requests based on whether it has been responded?

Let's say my python server has three different responses available. And one user send three HTTP requests at the same time.
How can I make sure that one requests get one unique response out of my three different responses?
I'm using python and mysql.
The problem is that even though I store already responded status in mysql, it's a bit too late by the time the next request came in.
For starters, if MySQL isn't handling your performance requirements (and it rightly shouldn't, that doesn't sound like a very sane use-case),
consider using something like in-memory caching, or for more flexibility, Redis:
It's built for stuff like this, and will likely respond much, much quicker.
As an added bonus, it has an even simpler implementation than SQL.
Second, consider hashing some user and request details and storing that hash with the response to be able to identify it.
Upon receiving a request, store an entry with a 'pending' status, and only handle 'pending' requests - never ones that are missing entirely.

fastest way to interact with a website in python

I know we can use selenium with chrome driver as a high level website interaction tool, we can speed it up with Phantomjs. Then we can use the requests module to be even faster. But that's where my knowledge stops.
whats the fastest possible way in python that someone can do post and get requests? I assume there is a lower level library than requests? do we use sockets and packets?
'To execute the request as fast as possible'
if python's requests lib is the fastest is there other libs in other programming languages such as c++ that worth a look at?
It really depends on the task, for web scraping 1000 pages its fine but when you need to requests.post 1000000+ it adds up. I've also looked into the multiprocessing lib. It helps a lot using all computational resources I have but traversing the network and waiting for the response is the thing that takes the longest. I would have thought the best way to increase speed is by sending and receiving less data. lets say receive only 5 input parameters and send only those 5 as a post back and wait for a 200 response. any ideas how can i do this without receiving all source code?
Thanks!

API GET requests from Specific IP - Requests Library - Python

I'm looking to switch existing PHP code over to Python using the Requests library. The PHP code sends thousands of GET requests to an API to get needed data. The API limits GET requests to one every 6 seconds per IP. We have numerous IP addresses in order to pull faster. The faster the better in this case.
My question is is there a way to send the GET request from different IP addresses using the Requests library? I'm also open to using different libraries in Python or different methods to replace the IP addresses.
The current code makes use of curl_multi_exec with the CURLOPT_INTERFACE setting.
As far as code goes, I don't necessarily need code examples. I'm looking for more of a direction or option that will allow such features in Python. I would prefer not post code, but if its necessary, let me know.
Thanks!
I don't believe Requests supports setting the outbound interface.
There is a Python cURL library, though.

Categories