How can I download many small files quickly? (Not Bandwidth limited)

How can I download many small files quickly? (Not Bandwidth limited) - python

I need to download ~50 CSV files in python. Based on the Google Chrome network stats, the download takes only 0.1 seconds, while the request takes about 7 seconds to process.
I am currently using headless Chrome to make the requests.
I tried multithreading, but from what I can tell, the browser doesn't support that (it can't make another request before the first request finishes processing). I don't think Multiprocessing is an option as this script will be hosted on a virtual server.
My next idea is to use the requests module instead of headless Chrome, but I am having issues connecting to the company network without a browser. Will this work, though? Any other solutions? Could I do something with multiple driver instances or multiple tabs on a single driver?Thanks!
Here's my code:
from Multiprocessing.pool import ThreadPool
driver=ChromeDriver()
Login(driver)
def getFile(item):
driver.get(url.format(item))
updateSet=blah
pool= ThreadPool(len(updateSet))
for item in updateSet:
pool.apply_async(getFile,(item,))
pool.close()
pool.join()

For request maybe try setting the user agent string to a browser like Chrome, ex: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36.
Some example code:
import requests
url = 'SOME URL'
headers = {
'User-Agent': 'user agent here',
'From': 'youremail#domain.com' # This is another valid field
}
response = requests.get(url, headers=headers)

Related

Getting reCaptcha in python requests

I am using python module requests to send some requests to google but after some requests, a reCaptcha pops up.I am using user agent but it still pops up!
What should I do?
I used user agent, it did change the browser looks but it did no effect on the Captcha problem
user_agent = 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/107.0.0.0 Safari/537.36'
sleep(2)
headers = {'User-Agent': user_agent}
proxies = {
'http': 'http://10.10.1.10:3128',
'https': 'http://10.10.1.10:1080',
}
file = requests.get(f'https://www.google.com/search?q=contact+email+{keyword}+site:{site}&num=100', headers=headers)
I used sleep but in vain. Any suggestions?

That’s kind of the entire point of captchas. They help deter against bots and spammers. Most captchas can’t be bypassed easily, so just changing the user agent won’t make the captcha go away. Since it sounds like the captchas only appear after a certain number of requests, you could use rotating residential proxies and change the session’s IP address whenever a captcha is detected.
Alternatively, you can use a captcha solving service like Anti-Captcha or DeathByCaptcha which involves parsing information about the captcha and then sending it to a service that has workers manually complete it for you. It’s not exactly convenient or efficient, though, and it can often take up to ~30 seconds for a worker to complete a single captcha. Both options cost money.

Error 403 web scraping python beautiful soup

'''
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.102 Safari/537.36'}
r = requests.get(url_,headers=headers)
'''
I am trying to scrape this website "https://allegro.pl/uzytkownik/feni44/lampy-przednie-i-elementy-swiatla-do-jazdy-dziennej-drl-255102?bmatch=cl-e2101-d3793-c3792-fd-60-aut-1-3-0412"
Everything was working good till yesterday, but suddenly I get 403 error.
I have used proxies/VPN but still the error persists.

When scraping a website, you must be careful of the website's anti-DDOS protection strategies. One form of DDOS is submitting many load requests at once via refresh, which can increase a server's load and hinder its performance. Using a web scraper does exactly that as it goes through each link, and so the website can mistake your bot as a DDOS'er and block it's IP address, making it FORBIDDEN (error 403) to access the website from it's IP address.
Usually this is only temporary, so after 12 hours or 24 hours (or however long the website sets a block period) it should be good to go. If you don't want to avoid a future 403 FORBIDDEN error, then consider sleeping for 10 seconds between each request.

Try to use some proxy services like Bright proxy. They have more than 72million+ proxies. I think this issue will resolve on rotating the proxy and useragent.

identical request detected as automated

How can a server detect a bot from a single HTML request identical to one made from an interactive session? For example, I can open a new private browser in Firefox, enter a URL and have everything come back 200. However, when I copy the initial HTML request that loaded the page -- url, headers and all -- and make it using a scripted tool like requests_html on the same device, I get a 403. What other information is the server using to differentiate between these two requests? Is there something that Firefox or requests_html are doing that is not visible from the developer tools and python code?
Sample code (domain substituted):
from request_html import HTMLSession
url = 'https://www.example.com'
headers = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'en-US,en;q=0.5',
'Connection': 'keep-alive',
'DNT': '1',
'Host': 'www.example.com',
'Upgrade-Insecure-Requests': '1',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/603.3.8 (KHTML, like Gecko) Version/10.1.2 Safari/603.3.8'
}
session = HTMLSession()
response = session.get(url, headers=headers)

I would really recommend using the selenium package. requests is really bad at dealing with dynamic loading and async displays. It's great for interacting with APIs, but if you're working with scraping, selenium is the tool.

requests_http is using a headless Chrome/Chromium browser, so if you really send an identical request it should not be distinguishable.
Normally an http requests, contains only the the http protocol, the method, the headers
So if both are identical it is strange, that a difference can be found out by the web server.
Servers might detect timing, but I assume, that this is the very first request and that you try out both from the same IP address.
Servers might detect a request if it is 100% identical to a previously performed request, but I assume you tested this already by firs trying with your script and then with your anonymous browser.
I assume also you looked at your browser and there were no redirects involved.
Some other differences, that could occur might be during the SSL negotiation (order of keys being offered / accepted.
It might be, that your browser also tries to access 'favicon.ico' and only then the page and that requests_http isn't doing this.
My suggestion is to first ensure, that you are capable of reproducing a request from your browser with requests_http.
I'd suggest following. try to setup your own web server on your local machine, in a virtual machine, in a container on one of your remote servers.
configure nginx to error logging at debug level.
Then perform the access with your private browser and then with your script using request_http and go through the generated log file and look for any difference.

Proxies not working with requests library

I'm recently working on occasional data intensive projects and I'm in need of gathering data from e-commerce platforms like Amazon so I created a web scraping program in Python. I'm using requests library along with a list of user agents and proxies however I think they are not working and it is causing failure of the program. Note that Amazon Api is limiting in terms of content and access rates and is not suitable for my needs.
Here's how I send requests:
import requests
import random
session = requests.session()
proxies = [{'https:': 'https://' + item.rstrip(), 'http':
'http://' + item.rstrip()} for item in open('proxies.txt').readlines()]
user_agent = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) '
'AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
print(session.get('https://icanhazip.com', proxies=random.choice(proxies), headers=user_agent).text)
However I keep getting the same ip address printed and this means the proxies are not working this way. and the proxies.txt contains proxies in this format:
ex:
178.168.19.139:30736
342.552.34.456:8080
...
What is the best way to workaround captchas and robot checks presented by Amazon using these tools (or extra tools if you have any suggestions) and why are the proxies failing to work?

I'm not sure if this will work for you, but I found that removing the protocol at the start of the ip within the dictionary solved the problem.
proxies = [{'https': item.rstrip(), 'http': item.rstrip()} for item in open('proxies.txt').readlines()]

getting past ReadTimeout from Python Requests

I'm trying to scrape the Home Depot website using Python and requests. Selenium Webdriver works fine, but takes way too much time, as the goal is to make a time-sensitive price comparison tool between local paint shops and power tool shops.
When I send a request to any other website, it works like normal. If I use any browser to navigate manually to the website, it also works fine (with or without session data/cookie data). I tried adding randomized headers into the request, but it does not seem to help the issue. From what I can see, it's not an issue of sending too many requests per time-period, (considering that selenium and manual browsing still works at any time.) I am confident that this specific issue is NOT because of a rate limitation.
my code:
from random import choice
import requests
import traceback
list_desktopagents = ['Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36']
def random_headers():
return {'User-Agent': choice(list_desktopagents),
'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8'}
response = requests.get(
'https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-
Interior-Exterior-Paint-390001/300831629',
headers=myheaders,
timeout=10)
my error:
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='www.homedepot.com', port=443): Read timed out. (read timeout=10)
Does anyone have a suggestion on what else I could do to successfully receive my response? I would prefer to use Requests, but anything that runs fast unlike selenium will be suitable. I understand that im being blocked, my question is not so much 'whats happening to stop me from scraping?', but rather, 'what can i do to further humanize my scraper so it allows me to continue?'

The error is coming from the User Agent. The reason why Selenium is working and not request is because Selenium is using a web driver to make the request, so it is more humanlike while request is much easier to be detected as a script. From Home Depot's robots.txt page it doesn't look like products are allowed for scraping. I just used this code and got a response by using this code:
headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
response = requests.get('https://www.homedepot.com/p/BEHR-1-gal-White-Alkyd-Semi-Gloss-Enamel-Alkyd-Interior-Exterior-Paint-390001/300831629', headers=headers)
print(response.content)
By using these user agents you can "trick" the site into thinking you are an actual person, which is what the web driver with Selenium does.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How can I download many small files quickly? (Not Bandwidth limited) - python

Related

Getting reCaptcha in python requests

Error 403 web scraping python beautiful soup

identical request detected as automated

Proxies not working with requests library

getting past ReadTimeout from Python Requests

Categories

Resources