How to send multiple get requests to a website in Python? - python

I'm in internship for 3 months in a computer science laboratory (LIRIS). My internship supervisor asks me to retrieve some data on meilleurs-agents.com. This is a real estate website and I would like to retrieve the price of square meter for each city. My program is in Python and I actually try to send multiple requests to get data. But it doesn't work because of a proxy error :
HTTPConnectionPool(host='XXXXXX', port=XXXX): Max retries exceeded with url: "..." (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000000000B304320>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',)))
A preview of my code :
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
})
for city, postal_code in zip(cities, postal_codes):
url = 'https://www.meilleursagents.com/prix-immobilier/'+city+'-'+postal_code+'/'
PROXY = {'https' : 'XX.XXX.X.XXX:XXXX'}
try:
response = requests.get(url, timeout=10, proxies=PROXY)
except Exception as e :
print(e)
If I remove the proxy, my request works but the html code contains a message like "you seems to be a bot so your request hasn't been completed" so I can't get prices... But I really need this data
Hope that my problem is clear and that someone could help me :)
Thanks, Nelly
PS : Sorry for my English, I'm a French student :D

Try changing User-Agent header and cookies for your requests.
Another workaround is to try adding some timeout between requests:
time.sleep(1) # try to use different time values
This of course will slow down your script, but may help to avoid too-many-requests errors.

Related

How can I login to this .aspx login webpage with the help of Python script?

I am working on project which requires scraping data from this site: https://www.trademap.org/>
I need to extract information like companies that import and export various commodities and products that could only be retrieved after log-in. Now, I am writing a python script that attempts to log-in to this login web page https://idserv.marketanalysis.intracen.org/Account/Login.
However, I am stuck and unable to write correct script as I am unfamiliar with scraping .aspx web-pages. This is my code:
import requests
from bs4 import BeautifulSoup
# start a sssion
session = requests.Session()
# Create the payload
payload = {
"email":"<my_email_id>",
"password":"<my_psswd>"
}
url = "https://idserv.marketanalysis.intracen.org/Account/Login?ReturnUrl=%2Fconnect%2Fauthorize%2Fcallback%3Fclient_id%3DTradeMap%26scope%3Dopenid%2520email%2520profile%2520offline_access%2520ActivityLog%26redirect_uri%3Dhttps%253A%252F%252Fwww.trademap.org%252FLoginCallback.aspx%26state%3D094c7f9db5c64cf3874fab75e9411cbf%26response_type%3Dcode%2520id_token%26nonce%3De74b2d9090074249bc8f89f569c5d3a1%26response_mode%3Dform_post"
# posting the payload to login url
Headers = {'User-Agent': "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"}
try:
post = session.post(url, data=payload, headers = Headers, verify = False)
print("Loggon was successful")
except:
print("Failed to login to Walmart!")
get_data = session.get("https://www.trademap.org/CompaniesList.aspx?nvpm=1%7c410%7c%7c%7c%7c72%7c%7c%7c2%7c1%7c1%7c2%7c3%7c1%7c2%7c1%7c1%7c4")
soup = BeautifulSoup(get_data.content,'html.parser')
print(soup)
I am getting exception error
requests.exceptions.SSLError: HTTPSConnectionPool(host='www.trademap.org', port=443): Max retries exceeded with url: /CompaniesList.aspx?nvpm=1%7C410%7C%7C%7C%7C72%7C%7C%7C2%7C1%7C1%7C2%7C3%7C1%7C2%7C1%7C1%7C4 (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:997)')))
I have tried beautifulsoup and scrapy libraries to attempt to log-in to the website but I get the exception error every time or another kind of error. I am unfamiliar with ASP.NET framework, APIs and JavaScript.

ProxyError when using Luminati Proxy Manager

I am trying currently to scrape a website using a proxy. I chose to use Luminati as proxy provider.
I created a zone with a Data Center using Shared IPs.
Luminati Dashboard
Then I installed Luminati Proxy Manager on my local machine, set up a proxy port using default config.
import requests
ip_json = requests.get('http://lumtest.com/myip.json', proxies={"http":"http://localhost:24000/",
"https":"http://localhost:24000/"}).json()
proxy = "https://" + ip_json['ip'] + ':' + str(ip_json['asn']['asnum'])
proxies ={"https": proxy , "http": proxy }
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get('http://google.com/', headers=headers, proxies=proxies)
However, each time I get
ProxyError: HTTPSConnectionPool(host='x.x.x.x', port=x): Max retries exceeded with url: http://google.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x11d3b5358>: Failed to establish a new connection: [Errno 61] Connection refused',)))
I tried using urls with http and https however nothing changed. I tried putting the proxy address as https, but still nothing.
Does anybody encountered this error and resolved it? I would really appreciate your help.
Thank you.
First, Luminati blocks google by default, and only allow specific use cases with the SERP zone.
Second, is google the only domain you target? Try lumtest.io/myip.json
The proxy manager should show you error codes in the logs, are there any clues there?
Try contacting Luminati live chat support from the control panel, you may have limitations on your account.

Error 10060 using Python requests library [duplicate]

This question already has answers here:
python requests.get always get 404
(3 answers)
Connection Error: A connection attempt failed because the connected party did not properly respond after a period of time
(3 answers)
Closed 4 years ago.
This is not a duplicate. The error message may be the same, but none of the provided solutions worked for my case.
I'm trying to get data from this url using Spyder as IDE.
http://graphs.gw.govt.nz/?siteName=Akatarawa%20River%20at%20Cemetery&dataSource=Rainfall&interval=1%20Day&Alignment=1%20Day
It can be opened with a browser but when I use requests.get method, it returns the following error:
HTTPConnectionPool(host='graphs.gw.govt.nz', port=80): Max retries
exceeded with url:
/?siteName=Akatarawa%20River%20at%20Cemetery&dataSource=Rainfall&interval=1%20Day&Alignment=1%20Day
(Caused by NewConnectionError(': Failed to establish a new connection:
[WinError 10060] A connection attempt failed because the connected
party did not properly respond after a period of time, or established
connection failed because connected host has failed to respond',))
Specifying user agent and referrer won't help.
Here's my code:
URL = 'http://graphs.gw.govt.nz/?siteName=Akatarawa%20River%20at%20Cemetery&dataSource=Rainfall&interval=1%20Day&Alignment=1%20Day'
urlHeaders = {"User-Agent": "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2.8) Gecko/20100722 Firefox/3.6.8 GTB7.1 (.NET CLR 3.5.30729)",
"Referer": "http://example.com"}
r = requests.get(URL, headers=urlHeaders, timeout=None)
Edit: I've tried running the code on an online python IDE and gets a 200 OK. So it may not an issue with the server.
Thanks!

How to resolve Requests get not working over VPN?

I am trying to scrape a website using requests in python.
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
# set the headers like we are a browser,
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers )
This is working fine when I use my personal wifi. However, when I connect to my company's VPN, I get the following error.
ConnectionError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Max retries exceeded with url: /questions/23013220/max-retries-exceeded-with-url (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))
Now, I need this to work over my company's VPN because I need to access a website which works only in that. How to resolve this?
In my case, the problem was related to IPv6.
Our VPN used split tunneling, and it seems the VPN configuration does not support IPv6.
So for example this would hang forever:
requests.get('https://pokeapi.co/api/v2/pokemon')
But if you add a timeout, the request succeeds:
requests.get('https://pokeapi.co/api/v2/pokemon', timeout=1)
But not all machines were having this problem. So I compared the output of this among two different machines:
import socket
for line in socket.getaddrinfo('pokeapi.co', 443):
print(line)
The working one only returned IPv4 addresses. The non-working machine returned both IPv4 and IPv6 addresses.
So with the timeout specified, my theory is that python fails quickly with IPv6 and then moves to IPv4, where the request succeeds.
Ultimately we resolved this by disabling IPv6 on the machine:
networksetup -setv6off "Wi-Fi"
But I assume that this could instead be resolved through VPN configuration.
How about trying like this:
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
ua = UserAgent()
headers = headers = {"User-Agent": ua.random}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers)
It seems to be caused by UserAgent() settings difference.
Try to set trust_env = None
trust_env = None #
Trust environment settings for proxy configuration, default authentication and similar.
Or you can disable proxies for a particular domain. The question
import os
os.environ['NO_PROXY'] = 'stackoverflow.com'
In my organization, I have to run my program under VPN for different geo locations. so we have multiple proxy configurations.
I found it simpler to use a package called PyPAC to get my proxy details automatically
from pypac import PACSession
from requests.auth import HTTPProxyAuth
session = PACSession()
# when the username and password is required
# session = PACSession(proxy_auth=HTTPProxyAuth(name, password))
r = session.get('http://example.org')
How does this work:
The package locates the PAC file which is configured by the organization. This file consist of proxy configuration detail (more info).

Python scraping script, avoid errors

I have large script where I scrape tweets from twitter without API because it's restricted. This script really do the job, but in some cases comes to an error like:
URLError: <urlopen error EOF occurred in violation of protocol (_ssl.c:590)>
Or something like:
URLError: <urlopen error [Errno 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond>
Of course, I am using User Agent
request = urllib2.Request(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/50.0.2661.94 Safari/537.36"})
response = urllib2.urlopen(request).read()
So website will think that I am using some of these web browsers, the script really works and do the infinitely scroll, everything, but after some number of tweets comes to an error like these. Maybe twitter set me to the blacklist or something like that? I changed my IP, same is happening.. Or if the script change IP after 5 loops maybe, is it possible?
I am using these imports in my script:
from bs4 import BeautifulSoup
import json, csv, urllib2, urllib, re

Categories