I tried to scrape a japanese website by trying some simple tutorial online but I could not get the information from the website. Below is my code:
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = requests.get(wiki)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.text, 'lxml')
for i in soup.findAll('data payments'):
print(i.text)
What I wanted to get is from the below part:
<dl class="data payments">
<dt>賃料:</dt>
<dd><span class="num">7.3万円</span></dd>
</dl>
I wish to print our the data payment which is "賃料" with price "7.3万円".
Expected(In string):
"payment: 賃料 is 7.3万円"
Edited:
import requests
wiki = "https://www.athome.co.jp/"
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'lxml')
print(soup.decode('utf-8', 'replace'))
In your latest version of code, you decode the soup and you will not be able to use functions like find and find_all in BeautifulSoup. But we will talk about it later.
To begin with
After getting the soup, you can print the soup, and you will see: (only showing the key part)
<meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
<meta content="0" http-equiv="expires"/>
<meta content="Tue, 01 Jan 1980 1:00:00 GMT" http-equiv="expires"/>
<meta content="10; url=/distil_r_captcha.html?requestId=2ac19293-8282-4602-8bf5-126d194a4827&httpReferrer=%2Fchintai%2F1001303243%2F%3FDOWN%3D2%26BKLISTID%3D002LPC%26sref%3Dlist_simple%26bi%3Dtatemono" http-equiv="refresh"/>
Which means that you do not obtain enough elements and you are detected as a crawler.
Therefore, there is something missing in #KunduK's answer, there has nothing to do with the find function yet.
Main Part
First of all, you need to make your python script less like a crawler.
Headers
The headers are most usually used to detect the cralwer.
In original requests, when you get a session from requests, you can check the headers with:
>>> s = requests.session()
>>> print(s.headers)
{'User-Agent': 'python-requests/2.22.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
You can see that the headers here will tell the server that you are a crawler program, which is python-requests/2.22.0.
Therefore, you need to modify the User-Agent with updating headers.
s = requests.session()
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
However, when testing the cralwer, it is still detected as a crawerl. Therefore, we need to dig further in headers part. (But it could be other reason like IP blocker, or Cookie reason. I will mention them later.)
In the Chrome, we open the Developer Tools, and open the website. (To pretend it is your first visit of the website, you had better clear the cookies first.) After clearing the cookies, refresh the page. We could see in the Network card of Developer Tools, it shows a lot of requests from the Chrome.
By entering the first attribute, which is https://www.athome.co.jp/, we could see a detailed table on the right side, in which the Request Headers are the headers generated by Chrome and used to requests the server of target site.
To make sure everthing works fine, you could just add everthing in this Chrome headers to your crawler, and it cannot find out you are the real Chrome or crawler anymore. (For most of sites, but I have also find some sites use starnge setting requiring a special header in every requests.)
I have already digged out that after adding accept-language, the website's anti-cralwer function will let you pass.
Therefore, all together, you need to update your headers like this.
headers = {
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
Cookie
For the explaination of cookie, you can refer to the wiki.
To obtain the cookie, there is a easy way.
First, initial a session and update the header, like I mentioned above.
Second, request to get the page https://www.athome.co.jp, once you get the page, you will obtain a cookie issued by the server.
s.get(url='https://www.athome.co.jp')
The advantage of requests.session is the session will help you to maintain the cookies, so your next request will use this cookie automatically.
You can just check the cookie you obtained by using this:
print(s.cookies)
And my result is:
<RequestsCookieJar[Cookie(version=0, name='athome_lab', value='ffba98ff.592d4d027d28b', port=None, port_specified=False, domain='www.athome.co.jp', domain_specified=False, domain_initial_dot=False, path='/', path_specified=True, secure=False, expires=1884177606, discard=False, comment=None, comment_url=None, rest={}, rfc2109=False)]>
You do not need to parse this page, because you just want the cookie rather than the content.
To get the content
You can just use the session you obtained to request the wiki page you mentioned.
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)
And now, everything you want will be posted to you by the server, you can just parse them with BeautifulSoup.
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
After getting the content you want, you can use BeautifulSoup to get the target element.
soup.find('dl', attrs={'class': 'data payments'})
And what you will get is:
<dl class="data payments">
<dt>賃料:</dt>
<dd><span class="num">7.3万円</span></dd>
</dl>
And you can just extract the infomation you want from it.
target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()
To format it as a line.
print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))
Everything has been done.
Summary
I will paste the code below.
# Import packages you want.
import requests
from bs4 import BeautifulSoup
# Initiate a session and update the headers.
s = requests.session()
headers = {
'accept-language': 'zh-CN,zh;q=0.9,en;q=0.8,zh-TW;q=0.7',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36',
}
s.headers.update(headers)
# Get the homepage of the website and get cookies.
s.get(url='https://www.athome.co.jp')
"""
# You might need to use the following part to check if you have successfully obtained the cookies.
# If not, you might be blocked by the anti-cralwer.
print(s.cookies)
"""
# Get the content from the page.
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
page = s.get(wiki)
# Parse the webpage for getting the elements.
soup = BeautifulSoup(page.content, 'html.parser')
target_content = soup.find('dl', attrs={'class': 'data payments'})
dt = target_content.find('dt').get_text()
dd = target_content.find('dd').get_text()
# Print the result.
print('payment: {dt} is {dd}'.format(dt=dt[:-1], dd=dd))
In crawler field, there is a long way to go.
You had better get the ontline of it, and make full use of the Developer Tools in the browser.
You might need to find out if the content is loaded by JavaScript, or if the content is in a iframe.
What's more, you migh be detected as a crawler and be blocked by the server. The anti-anti-crawler technique can only be obtained by coding more frequently.
I suggest you to start with an easier website without the anti-crawler function.
Try the below code.use class-name with tag to find the element.
from bs4 import BeautifulSoup
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
for i in soup.find_all("dl",class_="data payments"):
print(i.find('dt').text)
print(i.find('span').text)
Output:
賃料:
7.3万円
If you want to manipulate your expected output.Try that.
from bs4 import BeautifulSoup
import requests
wiki = "https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono"
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
page = requests.get(wiki,headers=headers)
soup = BeautifulSoup(page.content, 'lxml')
for i in soup.find_all("dl",class_="data payments"):
print("Payment: " + i.find('dt').text.split(':')[0] + " is " + i.find('span').text)
Output:
Payment: 賃料 is 7.3万円
The problem you are having is because the site is blocking your requests due to the fact that identifies it as coming from a bot.
The usual trick to do that is to attach the same headers (including the cookies) your browser sends in the request. You can see all headers Chrome is sending if you go to Inspect > Network > Request > Copy > Copy as Curl.
When you run your script, you get the following:
You reached this page when attempting to access https://www.athome.co.jp/chintai/1001303243/?DOWN=2&BKLISTID=002LPC&sref=list_simple&bi=tatemono from 152.172.223.133 on 2019-09-18 02:21:34 UTC.
Related
The code below extracts data from Zillow Sale.
My 1st question is where people get the headers information.
My 2nd question is how do I know when I needs headers? For some other page like Cars.com, I don't need put headers=headers and I can still get data correctly.
Thank you for your help.
HHC
import requests
from bs4 import BeautifulSoup
import re
url ='https://www.zillow.com/baltimore-md-21201/?searchQueryState=%7B%22pagination%22%3A%7B%7D%2C%22usersSearchTerm%22%3A%2221201%22%2C%22mapBounds%22%3A%7B%22west%22%3A-76.67377295275878%2C%22east%22%3A-76.5733510472412%2C%22south%22%3A39.26716345016057%2C%22north%22%3A39.32309233550334%7D%2C%22regionSelection%22%3A%5B%7B%22regionId%22%3A66811%2C%22regionType%22%3A7%7D%5D%2C%22isMapVisible%22%3Atrue%2C%22filterState%22%3A%7B%22ah%22%3A%7B%22value%22%3Atrue%7D%7D%2C%22isListVisible%22%3Atrue%2C%22mapZoom%22%3A14%7D'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
'referer': 'https://www.zillow.com/new-york-ny/rentals/2_p/?searchQueryState=%7B%22pagination'
}
raw_page = requests.get(url, headers=headers)
status = raw_page.status_code
print(status)
# Loading the page content into the beautiful soup
page = raw_page.content
page_soup = BeautifulSoup(page, 'html.parser')
print(page_soup)
You can get headers from going to the site with your browser and using the network tab of the developer tools in there, select a request and you can headers sent in requests.
Some websites don't serve bots, so to make them think you're not a bot you set the user agent header to one a browser uses, some sites may require more headers for you to pass the not a bot test. You can see all the headers being sent in developer tools, you can test different headers until your request succeeds.
from your browser go to this website: http://myhttpheader.com/
you will find headers info there.
Secondly, whenever some website like zillow blocks you from scraping data, only then we need to provide headers.
Check this picture:
enter image description here
I find How would I log into Instagram using BeautifulSoup4 and Requests, and how would I determine it on my own? this
but code
import re
import requests
from bs4 import BeautifulSoup
from datetime import datetime
link = 'https://www.instagram.com/accounts/login/'
login_url = 'https://www.instagram.com/accounts/login/ajax/'
time = int(datetime.now().timestamp())
payload = {
'username': 'login',
'enc_password': f'#PWD_INSTAGRAM_BROWSER:0:{time}:your_password',
'queryParams': {},
'optIntoOneTap': 'false'
}
with requests.Session() as s:
r = s.get(link)
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
r = s.post(login_url, data=payload, headers={
"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
"X-Requested-With": "XMLHttpRequest",
"Referer": "https://www.instagram.com/accounts/login/",
"x-csrftoken": csrf
})
print(r.status_code)
gives me error with csrftoken
line 21, in <module>
csrf = re.findall(r"csrf_token\":\"(.*?)\"", r.text)[0]
IndexError: list index out of range
and other posts on Stack Overflow don't work for me
I dont want use Selenium
TL;DR
Add a user-agent to your get request header on line 20:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Long answer
If we look at the error message you posted, we can start to dissect what's gone wrong. Line 21 is attempting to find a csrf_token attribute on the instagram login page.
Diagnostics
We can see from the error message that the list index is out of range, which in this case means that the list returned by re.findall (docs) is empty. This means that either
Your regex is wrong
The html returned by your get request (docs) r = s.get(link) on line 20 doesn't contain a csrf_token attribute
The attribute doesn't exist in the source html
If we visit the page and look at its html source, we can see that a csrf_token attribute is indeed present on line 261:
<script type="text/javascript">window._sharedData = {"config":{"csrf_token":"TOKEN HERE","viewer":null,"viewerId":null}}</script>
Note, I have excluded the rest on the code for brevity.
Now that we know it's present on the page, we can write the scraped html that you're receiving via your get request to a local file and inspect it:
r = s.get(link)
with open("csrf.html", "w") as f:
f.write(html)
If you open that file and do a Ctrl+f for csrf_token, it's not present. This likely means that Instagram detected that you're accessing the page via a scraper and returned a modified version of the page.
The fix
In order to fix this, you need to add a user-agent to your request header which essentially 'tricks' the page into thinking you're accessing it via a browser, This can be done by by changing:
r = s.get(link)
to something like this:
r = s.get(link, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; U; PPC Mac OS X 10_7_3 rv:3.0; sl-SI) AppleWebKit/533.38.2 (KHTML, like Gecko) Version/5.0 Safari/533.38.2'})
Note, this is a random user agent from here.
Notes
I appreciate that you don't want to use selenium for your task, but you might find that the more dynamic interactions you want to do, the harder it is to achieve it with static scraping libraries like the requests module. Here are some good resources for learning selenium in python:
Selenium docs
Python Selenium Tutorial #1 - Web Scraping, Bots & Testing
I'm not sure if there's an API for this but I'm trying to scrape the price of certain products from Wayfair.
I wrote some python code using Beautiful Soup and requests but I'm getting some HTML which mentions Our systems have detected unusual traffic from your computer network.
Is there anything I can do to make this work?
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'
}
def fetch_wayfair_price(url):
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.content)
print(soup)
fetch_wayfair_price('https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
The Wayfair site loads a lot of data in after the initial page, so just getting the html from the URL you provided probably won't be all you need. That being said, I was able to get the request to work.
Start by using a session from the requests libary; this will track cookies and session. I also added upgrade-insecure-requests: '1' to the headers so that you can avoid some of the issue that HTTPS introduce to scraping. The request now returns the same response as the browser request does, but the information you want is probably loaded in subsequent requests the browser/JS make.
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36',
'upgrade-insecure-requests': '1',
'dnt': '1'
}
def fetch_wayfair_price(url):
with requests.Session() as session:
post = session.get(url, headers=headers)
soup = BeautifulSoup(post.text, 'html.parser')
print(soup)
fetch_wayfair_price(
'https://www.wayfair.ca/furniture/pdp/mistana-katarina-5-piece-extendable-solid-wood-dining-set-mitn2584.html')
Note: The session that the requests libary creates really should persist outside of the fetch_wayfair_price method. I have contained it in the method to match your example.
i use code below to scrape results from bing and when I see the scraped web page it says "There are no results for python".
but when I search in the browser there is no problem.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = f'https://www.bing.com/search?q={term}&setlang=en-us'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
I searched and I didn't find any similar problem
You need to pass the user-agent while requesting to get the value.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = 'https://www.bing.com/search?q={}&setlang=en-us'.format(term)
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
Since Bing is a dynamic website, meaning Javascript generates the code, you won't be able to scrape it using only Beautifulsoup. Instead, I recommend selenium, which opens a browser that you can control and parse the code with Beautifulsoup.
The same will happen for any other dynamically coded website, including Google and many others.
It's probably because there's no user-agent being passed into request headers (as already mentioned by KunduK) thus when no user-agent is specified while using requests library, it defaults to python-requests so Bing or other search engines understands that it's a bot/script, then it blocks a request. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.
I'm trying to scrape Google Search Result but all I'm getting as an output is empty list. Do you have any idea what's wrong here? I found the similar post on Stack Overflow where solution says you should try putting user_agent. I tried but it still returns nothing. Please share if you have any idea.
import requests, webbrowser
from bs4 import BeautifulSoup
user_input = input("Enter something to search:")
print("googling.....")
google_search = requests.get("https://www.google.com/search?q="+user_input)
# print(google_search.text)
soup = BeautifulSoup(google_search.text , 'html.parser')
# print(soup.prettify())
search_results = soup.select('.r a')
# print(search_results)
for link in search_results[:5]:
actual_link = link.get('href')
print(actual_link)
webbrowser.open('https://google.com/'+actual_link)
Google blocks your requests and threw this error This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Learn moreSometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly..
Try using selenium + python to get all the links
To get results from Google page, you have to specify User-Agent http header. For english results, add hl=en parameter to search URL:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
print(actual_link)
Prints:
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:wHCoEH9G9w8J:https://en.wikipedia.org/wiki/Tree+&cd=22&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAVegQIAxAH
https://simple.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:tNzOpY417g8J:https://simple.wikipedia.org/wiki/Tree+&cd=23&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://simple.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAWegQIARAH
https://www.britannica.com/plant/tree
#
https://webcache.googleusercontent.com/search?q=cache:91hg5d2649QJ:https://www.britannica.com/plant/tree+&cd=24&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://www.britannica.com/plant/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAXegQIAhAJ
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
#
https://webcache.googleusercontent.com/search?q=cache:AVSszZLtPiQJ:https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree+&cd=25&hl=en&ct=clnk&gl=sk
https://teamtrees.org/
#
https://webcache.googleusercontent.com/search?q=cache:gVbpYoK7meUJ:https://teamtrees.org/+&cd=26&hl=en&ct=clnk&gl=sk
https://www.ldoceonline.com/dictionary/tree
#
https://webcache.googleusercontent.com/search?q=cache:oyS4e3WdMX8J:https://www.ldoceonline.com/dictionary/tree+&cd=27&hl=en&ct=clnk&gl=sk
https://en.wiktionary.org/wiki/tree
#
https://webcache.googleusercontent.com/search?q=cache:s_tZIjpvHZIJ:https://en.wiktionary.org/wiki/tree+&cd=28&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wiktionary.org/wiki/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAbegQICBAH
https://www.dictionary.com/browse/tree
#
https://webcache.googleusercontent.com/search?q=cache:EhFIP6m4MuIJ:https://www.dictionary.com/browse/tree+&cd=29&hl=en&ct=clnk&gl=sk
https://www.treepeople.org/tree-benefits
#
https://webcache.googleusercontent.com/search?q=cache:4wLYFp4zTuUJ:https://www.treepeople.org/tree-benefits+&cd=30&hl=en&ct=clnk&gl=sk
EDIT: To filter results you can use this:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
if actual_link.startswith('#') or \
actual_link.startswith('https://webcache.googleusercontent.com') or \
actual_link.startswith('/search?'):
continue
print(actual_link)
Prints (for example):
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
https://simple.wikipedia.org/wiki/Tree
https://www.britannica.com/plant/tree
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
https://teamtrees.org/
https://www.ldoceonline.com/dictionary/tree
https://en.wiktionary.org/wiki/tree
https://www.dictionary.com/browse/tree
https://www.treepeople.org/tree-benefits
Most websites nowadays use JavaScript to dynamically load their webpages. Google is one of those websites. In order for the full DOM (document object model) to load in, you need a Javascript engine, which beautifulsoup and requests don't have. Arun recommended selenium, and I do to, as it has an embedded Javascript engine.
Here is the Python Selenium documentation:
https://selenium-python.readthedocs.io/
The OP desired output doesn't come from JavaScript as Serket mentioned. All data that OP needed is located in the HTML.
There's no point in selenium as well for the same reason, it's all there, in the HTML, not rendered via JavaScript.
One of the problems as other people mentioned is because of no user-agent specified AND you possibly passed the wrong user-agent which leads to a completely different HTML that contains an error message or something similar. Check out what is your user-agent.
Pass user-agent:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
You can also grab attributes by passing them in square brackets:
element.get('href')
# is equivalent to
element['href']
Code and example in the online IDE (CSS selectors reference):
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah" # query
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with links and iterate over it
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
-------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.textualtees.com/products/fus-ro-dah-t-shirt
'''
Alternatively, you can achieve the same thing by using Google Search Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out why or how to deal with such a problem since this part (extraction/scraping) is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get what you want.
Code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro day",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
---------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.etsy.com/market/fus_ro_dah
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.textualtees.com/products/fus-ro-dah-t-shirt
https://tenor.com/search/fus-ro-dah-gifs
'''
P.S - I have a blog post that covers a bit more in-depth how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.