I am trying to scrape a website using requests and BeautifulSoup4 in Python, here is my code:
import requests
import bs4
result = requests.get("https://wolt.com/en/svk/bratislava/restaurant/la-donuteria-bratislava")
soup = bs4.BeautifulSoup(result.content,"html5lib")
for i in soup.find_all("div", {"class": re.compile("MenuItem-module__itemContainer____.*")}):
print(i.text)
print()
When I do this with the given url I get all results. However whenever I try to scrape this url for instance:
To be scraped
The result is truncated and I only get 43 results back. Is this a limitation of requests/BS4 or am I doing something else wrong?
Thanks
I think you get an error for too many request, i guess if you use this request you will not get banned from the API, use it and tell me!
req = Request(
"https://drand.cloudflare.com/public/latest",
headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:31.0) Gecko/20130401 Firefox/31.0"
},
)
Here is my code. This works, but when the code prints, its printing the DDOS attack website and not the website after that loads. I even tried to do a time.sleep(5) to help with timing.
How can I get past that.
import requests
from bs4 import BeautifulSoup
import time
url = 'https://www.psacard.com/cert/49628062'
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
soup = BeautifulSoup(requests.get(url, headers=headers).content, 'html.parser')
for a in soup.select('div'):
print(a)
If any parts of the web page is rendered dynamically, for example using Javascript, beautifulsoup might not be able to work with that.
use Selenium for scraping.
I tried to scrape a japanese website by trying some simple tutorial online but I could not get the information from the website. Below is my code:
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = requests.get(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, 'lxml')
for i in soup.findAll('data payments'):
print(i.text)
What I wanted to get is from the below part:
<dl class="data payments">
<dt>賃料:</dt>
<dd><span class="num">7.3万円</span></dd>
</dl>
I wish to print our the data payment which is "賃料" with price "7.3万円".
Expected(In string):
"payment: 賃料 is 7.3万円"
Edited:
import requests
wiki = "https://www.athome.co.jp/"
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'lxml')
print(soup.decode('utf-8', 'replace'))
In your latest version of code, you decode the soup and you will not be able to use functions like find and find_all in BeautifulSoup. But we will talk about it later.
To begin with
After getting the soup, you can print the soup, and you will see: (only showing the key part)
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="0" http-equiv="expires"/>
<meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/>
<meta content="10; url=/distil_r_captcha.html?requestId=2ac19293-8282-4602-8bf5-126d194a4827&httpReferrer=%2Fchintai%2F1001303243%2F%3FDOWN%3D2%26BKLISTID%3D002LPC%26sref%3Dlist_simple%26bi%3Dtatemono" http-equiv="refresh"/>
Which means that you do not obtain enough elements and you are detected as a crawler.
Therefore, there is something missing in #KunduK's answer, there has nothing to do with the find function yet.
Main Part
First of all, you need to make your python script less like a crawler.
Headers
The headers are most usually used to detect the cralwer.
In original requests, when you get a session from requests, you can check the headers with:
>>> s = requests.session()
>>> print(s.headers)
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
You can see that the headers here will tell the server that you are a crawler program, which is python-requests/2.22.0.
Therefore, you need to modify the User-Agent with updating headers.
s = requests.session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
However, when testing the cralwer, it is still detected as a crawerl. Therefore, we need to dig further in headers part. (But it could be other reason like IP blocker, or Cookie reason. I will mention them later.)
In the Chrome, we open the Developer Tools, and open the website. (To pretend it is your first visit of the website, you had better clear the cookies first.) After clearing the cookies, refresh the page. We could see in the Network card of Developer Tools, it shows a lot of requests from the Chrome.
By entering the first attribute, which is https://www.athome.co.jp/, we could see a detailed table on the right side, in which the Request Headers are the headers generated by Chrome and used to requests the server of target site.
To make sure everthing works fine, you could just add everthing in this Chrome headers to your crawler, and it cannot find out you are the real Chrome or crawler anymore. (For most of sites, but I have also find some sites use starnge setting requiring a special header in every requests.)
I have already digged out that after adding accept-language, the website's anti-cralwer function will let you pass.
Therefore, all together, you need to update your headers like this.
headers = {
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
Cookie
For the explaination of cookie, you can refer to the wiki.
To obtain the cookie, there is a easy way.
First, initial a session and update the header, like I mentioned above.
Second, request to get the page https://www.athome.co.jp, once you get the page, you will obtain a cookie issued by the server.
s.get(url='https://www.athome.co.jp')
The advantage of requests.session is the session will help you to maintain the cookies, so your next request will use this cookie automatically.
You can just check the cookie you obtained by using this:
print(s.cookies)
And my result is:
<RequestsCookieJar[Cookie(version=0, name='athome_lab', value='ffba98ff.592d4d027d28b', port=None, port_specified=False, domain='www.athome.co.jp', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1884177606, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>
You do not need to parse this page, because you just want the cookie rather than the content.
To get the content
You can just use the session you obtained to request the wiki page you mentioned.
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)
And now, everything you want will be posted to you by the server, you can just parse them with BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
After getting the content you want, you can use BeautifulSoup to get the target element.
soup.find('dl', attrs={'class': 'data payments'})
And what you will get is:
<dl class="data payments">
<dt>賃料:</dt>
<dd><span class="num">7.3万円</span></dd>
</dl>
And you can just extract the infomation you want from it.
target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()
To format it as a line.
print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))
Everything has been done.
Summary
I will paste the code below.
# Import packages you want.
import requests
from bs4 import BeautifulSoup
# Initiate a session and update the headers.
s = requests.session()
headers = {
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
# Get the homepage of the website and get cookies.
s.get(url='https://www.athome.co.jp')
"""
# You might need to use the following part to check if you have successfully obtained the cookies.
# If not, you might be blocked by the anti-cralwer.
print(s.cookies)
"""
# Get the content from the page.
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)
# Parse the webpage for getting the elements.
soup = BeautifulSoup(page.content, 'html.parser')
target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()
# Print the result.
print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))
In crawler field, there is a long way to go.
You had better get the ontline of it, and make full use of the Developer Tools in the browser.
You might need to find out if the content is loaded by JavaScript, or if the content is in a iframe.
What's more, you migh be detected as a crawler and be blocked by the server. The anti-anti-crawler technique can only be obtained by coding more frequently.
I suggest you to start with an easier website without the anti-crawler function.
Try the below code.use class-name with tag to find the element.
from bs4 import BeautifulSoup
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
for i in soup.find_all("dl",class_="data payments"):
print(i.find('dt').text)
print(i.find('span').text)
Output:
賃料:
7.3万円
If you want to manipulate your expected output.Try that.
from bs4 import BeautifulSoup
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
for i in soup.find_all("dl",class_="data payments"):
print("Payment: " + i.find('dt').text.split(':')[0] + " is " + i.find('span').text)
Output:
Payment: 賃料 is 7.3万円
The problem you are having is because the site is blocking your requests due to the fact that identifies it as coming from a bot.
The usual trick to do that is to attach the same headers (including the cookies) your browser sends in the request. You can see all headers Chrome is sending if you go to Inspect > Network > Request > Copy > Copy as Curl.
When you run your script, you get the following:
You reached this page when attempting to access https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono from 152.172.223.133 on 2019-09-18 02:21:34 UTC.
I am trying to to scrape the following page using python 3 but I keep getting HTTP Error 400: Bad Request. I have looked at some of the previous answers suggesting to use urllib.quote which didn't work for me since it's python 2. Also, I tried the following code as suggested by another post and still didn't work.
url = requote_uri('http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01')
with urllib.request.urlopen(url) as response:
html = response.read()
The server deny queries from non human-like User-Agent HTTP header.
Just pick a browser's User-Agent string and set it as header to your query:
import urllib.request
url = 'http://www.txhighereddata.org/Interactive/CIP/CIPGroup.cfm?GroupCode=01'
headers={
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0"
}
request = urllib.request.Request(url, headers=headers)
with urllib.request.urlopen(request) as response:
html = response.read()
I'm trying to parse the div class titled "dealer-info" from the URL below.
https://www.nissanusa.com/dealer-locator.html
I tried this:
import urllib.request
from bs4 import BeautifulSoup
url = "https://www.nissanusa.com/dealer-locator.html"
text = urllib.request.urlopen(url).read()
soup = BeautifulSoup(text)
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
Normally, I would expect that to work, but I'm getting this result: HTTPError: Forbidden
Also, tried this.
import urllib.request
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
data = response.read() # The data u need
print(data)
That gives me all the HTML on the site, but it's pretty ugly to look at, or make any sense of at all.
I'm trying to get a structured data set, of "dealer-info". I am using Python 3.6.
You might be being rejected by the server in your first example due to not pretending to be an ordinary browser. You should try combining the user agent code from the second example with the Beautiful Soup code from the first:
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.nissanusa.com/dealer-locator.html"
headers={'User-Agent':user_agent,}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
text = response.read()
soup = BeautifulSoup(text, "lxml")
data = soup.findAll('div',attrs={'class':'dealer-info'})
for div in data:
links = div.findAll('a')
for a in links:
print(a['href'])
Keep in mind that if the web site is explicitly trying to keep Beautiful Soup or other non-recognized user agents out, they may take issue with you scraping their web site data. You should consult and obey https://www.nissanusa.com/robots.txt as well as any terms of use or terms of service agreements you may have agreed to.