fastest way to interact with a website in python - python

I know we can use selenium with chrome driver as a high level website interaction tool, we can speed it up with Phantomjs. Then we can use the requests module to be even faster. But that's where my knowledge stops.
whats the fastest possible way in python that someone can do post and get requests? I assume there is a lower level library than requests? do we use sockets and packets?
'To execute the request as fast as possible'
if python's requests lib is the fastest is there other libs in other programming languages such as c++ that worth a look at?
It really depends on the task, for web scraping 1000 pages its fine but when you need to requests.post 1000000+ it adds up. I've also looked into the multiprocessing lib. It helps a lot using all computational resources I have but traversing the network and waiting for the response is the thing that takes the longest. I would have thought the best way to increase speed is by sending and receiving less data. lets say receive only 5 input parameters and send only those 5 as a post back and wait for a 200 response. any ideas how can i do this without receiving all source code?
Thanks!

Related

How does a server detect whether sort of automated http request is being made

I was making some test http requests using Python's request library. When searching for Walmart's Canadian site (www.walmart.ca), I got this:
How do servers like Walmart's detect that my request is being made programatically? I understand browsers send all sorts of metadata to the server. I was hoping to get a few specific examples of how this is commonly done. I've found a similar question, albeit related to Selenium Web Driver, here where it claims that there are some vendors that provide this service but I was hoping to get something a bit more specific.
Appreciate any insights, thanks.
As mentioned in the comments, a real browsers send many different values - headers, cookies, data. It reads from server not only HTML but also images, CSS, JS, fonts. Browser can also run JavaScript which can get other information about browser - version, extensions, data in local storage, etc (i.e how you move mouse). And real human loads/visits pages with random delays and in rather in random order. And all these elements can be used to detect a script. Servers may use very complex systems even Machine Learning (Artificial Intelligence) and use data from few mintues or hours to compare your behavior.

While query data (web scraping) from a website with Python, how to avoid being blocked by the server?

I was trying to using python requests and mechanize to gather information from a website. This process needs me to post some information then get the results from that website. I automate this process using for loop in Python. However, after ~500 queries, I was told that I am blocked due to high query rate. It takes about 1 sec to do each query. I was using some software online where they query multiple data without problems. Could anyone help me how to avoid this issue? Thanks!
No idea how to solve this.
--- I am looping this process (by auto changing case number) and export data to csv....
After some queries, I was told that my IP was blocked.
Optimum randomized delay time between requests.
Randomized real user-agents for
each request.
Enabling cookies.
Using a working proxy pool and
selecting a random proxy for each request.

Multi threading script for HTTP status codes

Hi Stackoverflow community,
I would like to create a script that uses multi threading to create a high number of parallel requests for HTTP status codes on a large list of URL's (more than 30k vhosts).
The requests can be executed from the same server where the websites are hosted.
I was using multithreaded curl requests, but I'm not really satisfied with the results I've got. For a complete check of 30k hosts it takes more than an hour.
I am wondering if anyone has any tips or is there a more performant way to do it?
After testing some of the available solutions, the simplest and the fastest way was using webchk
webchk is a command-line tool developed in Python 3 for checking the HTTP status codes and response headers of URLs
The speed was impressive, the output was clean, it parsed 30k vhosts in about 2 minutes
https://webchk.readthedocs.io/en/latest/index.html
https://pypi.org/project/webchk/
If you're looking for parallelism and multi-threaded approaches to make HTTP requests with Python, then you might start with the aiohttp library, or use the popular requests package. Multithreading can be accomplished with multiprocessing, from the standard library.
Here's a discussion of rate limiting with aiohttp client: aiohttp: rate limiting parallel requests
Here's a discussion about making multiprocessing work with requests https://stackoverflow.com/a/27547938/10553976
Making it performant is a matter of your implementation. Be sure to profile your attempts and compare to your current implementation.

Using curl vs Python requests

When doing a scrape of a site, which would be preferable: using curl, or using Python's requests library?
I originally planned to use requests and explicitly specify a user agent. However, when I use this I often get an "HTTP 429 too many requests" error, whereas with curl, it seems to avoid that.
I need to update metadata information on 10,000 titles, and I need a way to pull down the information for each of the titles in a parallelized fashion.
What are the pros and cons of using each for pulling down information?
Since you want to parallelize the requests, you should use requests with grequests (if you're using gevent, or erequests if you're using eventlet). You may have to throttle how quickly you hit the website though since they may do some ratelimiting and be refusing you for requesting too much in too short a period of time.
Using requests would allow you to do it programmatically, which should result in a cleaner product.
If you use curl, you're doing os.system calls which are slower.
I'd go for the in-language version over an external program any day, because it's less hassle.
Only if it turns out unworkable would I fall back to this. Always consider that people's time is infinitely more valuable than machine time. Any "performance gains" in such an application will probably be swamped by network delays anyway.

Streaming the result of a command back to the browser using Twisted and Comet

I'm writing an application that streams the output (by this I mean both sys.stdout and sys.stderr) of a python script excited on the server, in real time to the browser.
The users on the site will be allowed to select the script to run, excite and kill their chosen script, and change some parameters, so I will need a different thread per user on the site (user A can start, stop and change a script, whilst user B can do the same with a different script).
I know I need to use comet for the web clients, and seeing as the rest of the project is written in python, I'd like to use twisted for the server, however I'm not really sure of what I need to do next!
There are a daunting number of options (Divmod Mantissa, Divmod Nevow, twisted.web, STOMP, etc), and some are better documented that others, making the whole thing rather tricky!
I have a working demo using stompservice on orbited, using Orbited.TCPSocket for the javascript side of things, however I'm starting to think that STOMPs channel model isn't going to work for multithreading, multi-running scripts (unless I open a new channel per run, but that seems like the wrong use of the channel model).
Can anyone point me in the right direction, or some sample code I can learn from?
Thanks!
Nevow Athena is a framework specifically for AJAX and COMET applications and in theory is exactly the sort of thing you are looking for.
However, I am not sure that it is well used or supported at this time - looking at mailing list traffic and google search results suggests that it may not be.
There are a couple of tutorials you could look at to help you decide on it:
one on the 'official' site: http://divmod.org/trac/wiki/DivmodNevow/Athena/Tutorials/LiveElement
and one other that I found:
http://divmodsphinx.funsize.net/nevow/chattutorial/part01/index.html
The code for the latter seems to be included in the Nevow distribution when you download it under /doc/listings/partxx (I think...)
You can implement a very simple "HTTP streaming" by keeping the http connection open and appending javascript chunks that update the dom contents. This works since the browser evaluates the "script" chunks as they arrive.
I wrote a blog entry a while ago with a running example using twisted and very few lines of javascript: Simple HTTP streaming with Twisted & Javascript
You can easily mix this pattern with a publisher/subscriber pattern to make it multiuser, etc. I use this pattern to watch live log streams via web.
An example of serving for long-polling clients with Twisted is slosh. This might not be what you want, but because it's not a large framework, it can help you figure out how to use Twisted.

Categories