I am trying to scrape google search results using the following code. I want to take the title and the url of the first page of the results and then continue by scraping the next pages of the search results too.
This is a sample of code that I just started writing:
from urllib.request import urlopen as uReq
import urllib.request
from bs4 import BeautifulSoup as soup
paging_url = "https://www.google.gr/search?q=donald+trump&ei=F91FW8XBGYjJsQHQwaWADA&start=110&sa=N&biw=811&bih=662"
req = urllib.request.Request("https://www.google.gr/search?q=donald+trump&ei
=F91FW8XBGYjJsQHQwaWADA&start=110&sa=N&biw=811&bih=662",headers = {'User-Agent':"Magic Browser"})
UClient = uReq(req) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
I noticed that all google results have a common class named "g". So I wrote the following command:
results= page_soup.findAll("div",{"class":"g"})
But after testing the results taken are not the same that I see when I visit the initial Url.
Moreover some div tags such as:
<div data-hveid="38" data-ved="0ahUKEwjGp7XEj5fcAhXMDZoKHRf8DJMQFQgmKAAwAA">
and
<div class="rc">
can not be seen in the tree that BeautifulSoup produces. Meaning I can not use findAll function to locate objects inside those tags because BeautifulSoup acts like they do not exist.
Why all this happens?
I would never scrape Google directly via raw http requests. Google can detect it so easily. In order to be not detected, I suggest using a automated browser like Chrome with selenium.
In your example, the problem is that Google delivers a differnt HTML version of it's SERP page because it detects the low level http scraping.
There exists open source libraries that handle all the difficult parts of scraping. For example GoogleScraper, which is a tool written in Python3 that supports three different scraping modes: Http raw scraping, selenium mode (with real browsers) and asynchronous http mode.
To paginate through all Google Search pages, you need to use while True loop to scrape the first and subsequent pages.
Pagination runs while the next button exists (is determined by the presence of the button selector on the page, in our case the CSS selector is .d6cvqb a[id=pnnext]), you need to increase the value of the ["start"] parameter by 10 to access the next page if it is present, otherwise we need to break out of the while loop:
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Check code with using pagination in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "donald trump", # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
# "filter": 0 # shows more than 10 pages. By default up to ~10-13 if filter = 1.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = f'Title: {result.select_one("h3").text}'
link = f'Link: {result.select_one("a")["href"]}'
website_data.append({
"title" : title,
"link" : link,
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Title: Donald J. Trump: Home",
"link": "Link: https://www.donaldjtrump.com/"
},
{
"title": "Title: Donald Trump - Wikipedia",
"link": "Link: https://en.wikipedia.org/wiki/Donald_Trump"
},
{
"title": "Title: Donald J. Trump | The White House",
"link": "Link: https://www.whitehouse.gov/about-the-white-house/presidents/donald-j-trump/"
},
{
"title": "Title: Donald Trump - CNBC",
"link": "Link: https://www.cnbc.com/donald-trump/"
},
{
"title": "Title: Donald Trump | The Hill | Page 1",
"link": "Link: https://thehill.com/people/donald-trump/"
},
# ...
]
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "donald trump", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"page_num": page_num,
"title": result.get("title"),
"link": result.get("link"),
"displayed_link": result.get("displayed_link"),
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"page_num": 1,
"title": "Donald J. Trump: Home",
"link": "https://www.donaldjtrump.com/",
"displayed_link": "https://www.donaldjtrump.com"
},
{
"page_num": 1,
"title": "Donald Trump - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Donald_Trump",
"displayed_link": "https://en.wikipedia.org › wiki › Donald_Trump"
},
{
"page_num": 1,
"title": "Donald J. Trump | The White House",
"link": "https://www.whitehouse.gov/about-the-white-house/presidents/donald-j-trump/",
"displayed_link": "https://www.whitehouse.gov › ... › Presidents"
},
{
"page_num": 1,
"title": "Donald Trump - CNBC",
"link": "https://www.cnbc.com/donald-trump/",
"displayed_link": "https://www.cnbc.com › donald-trump"
},
# ...
]
Disclaimer, I work for SerpApi.
Related
I am completely new to Web Scraping using Python BeautifulSoup. Using SO and other blogs I have tried to build my first piece of code.
Objective: Given a string as input, the code will search in https://www.google.com and get the search results with following information:
Title
Brief Description
Link
Say I want to search "Core Banking Solution by Accenture". To do so :
search_str ='Core Banking Solution Accenture'
url = 'https://www.google.com/search?q=' + search_str
page = requests.get(url)
soup = bs(page.content,'lxml')
for node in soup.find_all('a'):
print("Inner Text : {}".format(node.text))
for h in soup.find_all('h3'):
print("Title : {}".format(h.text)
print("Link :{}".format(node.get('href'))
I am getting an error for title.
AttributeError: 'NoneType' object has no attribute 'text'
Clearly, it is not getting any title object from soup.find_all('a').
Question: What trick I have to apply here to get the title? Does soup.find_all('a') really contains any title` tag? What I am missing here?
Update: Based on the suggestions received, I have updated code piece. Now it is working. Although need to check the results in details.
In this case, node.get("title") returns None, as there is no title-attribute in the a-tag. This causes an error when accessing the .text-attribute. Instead, we may examine the HTML to figure out which part of the tag constitutes the title. See the example below.
from bs4 import BeautifulSoup
html = '''
<a href="https://html.com/attributes/a-title">
<br>
<h3>When To Use A Title In HTML</h3>
<div>
<cite role="text">https://html.com
<span role="text"> › attributes › a-title</span>
</cite>
</div>
</a>
'''
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
for tag in a_tags:
print(tag.get("title")) # prints None
print(tag.h3.text) # prints "When To Use A Title In HTML"
In this search result, the title is found in the contents of the h3-tag. We can extract it using tag.h3.text. The browser devtools are a great resource when trying to figure out how to navigate the tags.
You might receive an empty output because of not using user-agent to act as a "real" user request.
If the request is being blocked, the response would contain different HTML with different elements/selectors and some sort of an error and that's why you might get an AttributeError: 'NoneType' object has no attribute 'text' error.
Also, when not specifying user-agent in request headers, requests library defaults to python-requests which websites understand and might block because they understand that it's a script that sends a request.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "Core Banking Solution Accenture", # search query
"gl": "us", # country of the search
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
results = []
for index, result in enumerate(soup.select(".tF2Cxc"), start=1):
title = result.select_one(".DKV0Md").text
link = result.select_one(".yuRUbf a")["href"]
try:
snippet = result.select_one("#rso .lyLwlc").text
except: snippet = None
results.append({
"position": index,
"title": title,
"link": link,
"snippet": snippet
})
print(json.dumps(results, indent=2, ensure_ascii=False))
Part of the output:
[
{
"position": 1,
"title": "Core Banking Services | Accenture",
"link": "https://www.accenture.com/us-en/services/banking/core-banking",
"snippet": "Accenture brings together the skills, technologies and capabilities to renew core banking systems in ways that put customers first and enable banks to release ..."
},
{
"position": 2,
"title": "Accenture In Consortium Hired to Implement Core Banking ...",
"link": "https://newsroom.accenture.com/industries/banking/accenture-in-consortium-hired-to-implement-core-banking-solution-at-polands-largest-bank.htm",
"snippet": "WARSAW; Aug. 18, 2003 –Accenture has signed a $114 million contract to implement a core-banking platform at PKO BP, Poland's largest bank, as part of a ..."
}, ... other results
]
Alternatively, you can achieve it by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create the parser from scratch, maintain it, figure out how to bypass blocks from Google or other search engines and how scale it.
Code to integrate:
from serpapi import GoogleSearch
import json, os
params = {
"api_key": "serpapi_key", # your serpapi api key
"engine": "google", # search engine
"q": "Core Banking Solution Accenture", # search query
"google_domain": "google.com", # google domain
"gl": "us", # country to search from
"hl": "en" # language
# other parameters
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
data = []
for result in results["organic_results"]:
data.append({
"position": result.get("position"),
"title": result.get("title"),
"link": result.get("link"),
"snippet": result.get("snippet")
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Part of the output:
[
{
"position": 1,
"title": "Core Banking Services | Accenture",
"link": "https://www.accenture.com/us-en/services/banking/core-banking",
"snippet": "Accenture brings together the skills, technologies and capabilities to renew core banking systems in ways that put customers first and enable banks to release ..."
},
{
"position": 2,
"title": "Accenture In Consortium Hired to Implement Core Banking ...",
"link": "https://newsroom.accenture.com/industries/banking/accenture-in-consortium-hired-to-implement-core-banking-solution-at-polands-largest-bank.htm",
"snippet": "WARSAW; Aug. 18, 2003 –Accenture has signed a $114 million contract to implement a core-banking platform at PKO BP, Poland's largest bank, as part of a ..."
}, ... other results
]
Disclaimer, I work for SerpApi.
I'm a newbie with python.
In PyCharm I wrote this code:
import requests
from bs4 import BeautifulSoup
response = requests.get(f"https://www.google.com/search?q=fitness+wear")
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Instead getting the HTML of the search results, what I get is the HTML of the following page
I use the same code within a script on pythonanywhere.com and it works perfectly. I've tried lots of the solutions I found but the result is always the same, so now I'm stuck with it.
I think this should work:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
url = f"https://www.google.com/search?q=fitness+wear"
headers = {
"referer":"referer: https://www.google.com/",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
}
s.post(url, headers=headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
It uses a request session and a post request to create any initial cookies (not fully sure on this) and then allows you scrape.
If you open up a private Window in your browser and go to google.com, you should see the same pop-up prompting you to give your consent. This is, because you don't have session cookies send.
You have different options to tackle this.
One would be sending the cookies you can observe on the website with the request directly like so:
import requests
cookies = {"CONSENT":"YES+shp.gws-20210330-0-RC1.de+FX+412", ...}
resp = request.get(f"https://www.google.com/search?q=fitness+wear",cookies=cookies)
The solution #Dimitriy Kruglikov uses is a lot cleaner though and using sessions is a good way of having a persistent Session with the website.
Google doesn't block you, you still can extract data from the HTML.
Using cookies isn't very convenient and using session with post and get requests will lead to a bigger amount of traffic.
You can remove this popup by either using decompose() or extract() BS4 methods:
annoying_popup.decompose() will completely destroy it and its contents. Documentation.
annoying_popup.extract() will make another html tree: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. Documentation.
After that, you can scrape everything you need as well as without removing it.
See this Organic Results extraction I did recently. It scrapes title, summary, and link from Google Search Results.
Alternatively, you can use Google Search Engine Results API from SerpApi. Check out the Playground.
Code and example in online IDE:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "fus ro dah",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(f"Title: {result['title']}\nSnippet: {result['snippet']}\nLink: {result['link']}\n")
Output:
Title: Skyrim - FUS RO DAH (Dovahkiin) HD - YouTube
Snippet: I looked around for a fan made track that included Fus Ro Dah, but the ones that I found were pretty bad - some ...
Link: https://www.youtube.com/watch?v=JblD-FN3tgs
Title: Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Snippet: If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: "Fus Rah Do" instead of the proper "Fus Ro Dah." ...
Link: https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
Title: Fus Ro Dah | Know Your Meme
Snippet: Origin. "Fus Ro Dah" are the words for the "unrelenting force" thu'um shout in the game Elder Scrolls V: Skyrim. After reaching the first town of ...
Link: https://knowyourmeme.com/memes/fus-ro-dah
Title: Fus ro dah - Urban Dictionary
Snippet: 1. A dragon shout used in The Elder Scrolls V: Skyrim. 2.An international term for oral sex given by a female. ex.1. The Dragonborn yelled "Fus ...
Link: https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
Part of JSON:
"organic_results": [
{
"position": 1,
"title": "Unrelenting Force (Skyrim) | Elder Scrolls | Fandom",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)",
"displayed_link": "https://elderscrolls.fandom.com › wiki › Unrelenting_F...",
"snippet": "If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: \"Fus Rah Do\" instead of the proper \"Fus Ro Dah.\" ...",
"sitelinks": {
"inline": [
{
"title": "Location",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Location"
},
{
"title": "Effect",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Effect"
},
{
"title": "Usage",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Usage"
},
{
"title": "Word Wall",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Word_Wall"
}
]
},
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:K3LEBjvPps0J:https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)+&cd=17&hl=en&ct=clnk&gl=us"
}
]
Disclaimer, I work for SerpApi.
I am trying to scrape the title of every item on an ebay page. This is the page. I first tried to scrape the title of the first listing (lines 5-7 of my code) , and I was successful as the title of the first listing gets printed. But when I try to scrape every single title on the ebay page (lines 8-10), nothing gets printed. Is there a flaw in my logic? Thanks!
1. from bs4 import BeautifulSoup
2. import requests
3. source = requests.get("https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=hippo&_sacat=0").text
4. soup = BeautifulSoup(source, "lxml")
5. listing = soup.find("li", class_=("s-item s-item--watch-at-corner"))
6. title = soup.find("h3", class_=("s-item__title")).text
7. print(title)
8. for listing in soup.find_all("li", class_=("s-item s-item--watch-at-corner")):
9. title = soup.find("h3", class_=("s-item__title")).text
10. print(title)
After a quick glance at the docs:
BeautifulSoup's .find_all() method returns a list (as one would expect). However, it seems to me that the .find() in your for loop is just querying the response again, rather than doing something with the list you're generating. I would expect either extracting the titles manually, such as:
title = listing['some_property']
or perhaps there's another method provided by the library you're using.
By looking at the code you haven't checked the type of the class.
from bs4 import BeautifulSoup
import requests
source=requests.get("https://www.ebay.com/sch/i.html_from=R40&_trksid=p2380057.m570.l1313&_nkw=hippo&_sacat=0").text
soup = BeautifulSoup(source, "lxml")
listing = soup.find("li", class_=("s-item s-item--watch-at-corner"))
title = soup.find("h3", class_=("s-item__title")).text
print(type(listing))
This returns the result of
<class 'NoneType'>
So the parsing ends as there are no li tags to find
You're calling find("h3", class_=("s-item__title") on the soup every time, you need to call it on every listing in the loop or it will always fetch the first title. Also, keep in mind there were a couple of hidden results on the eBay page for whatever reason, maybe check that out and see if you want to ignore or include those as well. I added enumerate function in the loop just to keep track of the number of the results.
I used this selector to find all the listing on the chrome dev tool li.s-item.s-item--watch-at-corner h3.s-item__title
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=hippo&_sacat=0").text
soup = BeautifulSoup(source, "lxml")
listing = soup.find("li", class_=("s-item s-item--watch-at-corner"))
title = soup.find("h3", class_=("s-item__title")).text
print(title)
for i, listing in enumerate(soup.find_all("li", class_=("s-item s-item--watch-at-corner"))):
title = listing.find("h3", class_=("s-item__title")).text
print("[{}] ".format(i) + title)
Result:
[0] Pewter Hippopotamus Hippo Figurine
[1] Hippopotamus Figurine 1.5" Gemstone Opalite Crystal Healing Carved Statue Decor
[2] hippopotamus coffee cafe picture animal hippo art tile gift
[3] NEW! Miniature Bronze Hippo Figurine Miniature Bronze Statue Animal Collectible
[4] Hippopotamus Gzhel porcelain figurine hippo handmade
[5] Hippopotamus Gzhel porcelain figurine hippo souvenir handmade and hand-painted
....
Have a look at the SelectorGadget Chrome extension to easily pick selectors by clicking on the desired element in your browser which is not always working perfectly, if the page heavily uses JS (in this case we can).
There is also the possibility of blocking the request, if using requests as default user-agent in requests library is a python-requests.
An additional step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
Check out the code in the online IDE
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
}
params = {
'_nkw': 'hippo', # search query
}
data = []
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
for title in soup.select(".s-item__title span"):
if "Shop on eBay" in title:
pass
else:
data.append({"title" : title.text})
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Hippopotamus Gzhel porcelain figurine hippo souvenir handmade and hand-painted"
},
{
"title": "Coad Peru Signed 1 1/2\" HIPPO Clay Pottery Collectible Figurine"
},
{
"title": "Glass Hippo Hippopotamus figurine \"murano\" handmade"
},
{
"title": "2 Hand Carved Soapstone Brown Hippopotamus Hippo Animal Figurine Paperweight"
},
{
"title": "Schleich Hippo D-73527 Hippopotamus Mouth Open Wildlife Toy Figure 2012"
},
# ...
]
As an alternative, you can use Ebay Organic Results API from SerpApi. It's a paid API with a free plan that handles blocks and parsing on their backend.
Example code:
from serpapi import EbaySearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # serpapi api key
"engine": "ebay", # search engine
"ebay_domain": "ebay.com", # ebay domain
"_nkw": "hippo" # search query
}
search = EbaySearch(params) # where data extraction happens
data = []
results = search.get_dict() # JSON -> Python dict
for organic_result in results.get("organic_results", []):
title = organic_result.get("title")
data.append({
"title" : title
})
print(json.dumps(data, indent=2))
Output:
[
{
"title": "Schleich Hippo D-73527 Hippopotamus Mouth Open Wildlife Toy Figure 2012"
},
{
"title": "Vintage WOODEN SCULPTURE Hippo HIPPOPOTAMUS"
},
{
"title": "Hippopotamus Gzhel porcelain figurine hippo souvenir handmade and hand-painted"
},
{
"title": "Glass Hippo Hippopotamus figurine \"murano\" handmade"
},
# ...
]
I am trying to scrape google knowledge panel to retrieve the name of drugs if they do not appear in google search. For instance if I look for "Buscopan" in Google, the appearing webpage looks like this:
Now, what I am trying to do with the code shown is take the term "Scopolamina-N-butilbromuro" in the knowledge panel but am actually unable to retrieve it in the html code once I inspect the element. To be precise. The code I am implementing together with the error message is as follows:
import requests
from bs4 import BeautifulSoup
URL
url = "https://www.google.com/search?client=safari&rls=en&q="+"buscopan"+"&ie=UTF-8&oe=UTF-8"
# Sending HTTP request
req = requests.get(url)
# Pulling HTTP data from internet
sor = BeautifulSoup(req.text, "html.parser")
temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
print(temp)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-ef5599a1a1fc> in <module>
13 # Finding temperature in Celsius
14 #temp = sor.find("h2", class_='qrShPb').text
---> 15 temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
16
17
AttributeError: 'NoneType' object has no attribute 'text'
I don't know what I am doing wrong. I think the bit of html code I need to look at is the following:
<h2 class="qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe" data-local-attribute="d3bn" data-attrid="title" data-ved="2ahUKEwjujfLcgO7rAhWKjosKHSiBAFEQ3B0oATASegQIEBAL"></h2>
Of course the rest of the html code is in the picture reported, but if you need a bigger version, please, do not esitate!
Any suggestion?
Thank you,
Federico
To get correct result page from Google search, specify User-Agent HTTP header. For example:
import requests
from bs4 import BeautifulSoup
params = {
'q': 'buscopan', # <-- change to your keyword
'hl': 'it' # <-- change to `en` for english results
}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.google.com/search'
soup = BeautifulSoup(requests.get(url, params=params, headers=headers).content, 'html.parser')
print(soup.select_one('h2[data-attrid="title"]').text)
Prints:
Scopolamina-N-butilbromuro
Alternatively, to Andrej Kesely solution, you can use third-party Google Knowledge Graph API from SerpApi. It's a paid API with a free plan. Check out the Playground to test.
Code to integrate and full example in the online IDE:
from serpapi import GoogleSearch
import os
params = {
"q": "Buscopan",
"google_domain": "google.com",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['knowledge_graph']['title']
print(title)
Output:
Butylscopolamine
Part of JSON Knowledge Graph output:
"knowledge_graph": {
"title": "Butylscopolamine",
"type": "Medication",
"description": "Hyoscine butylbromide, also known as scopolamine butylbromide and sold under the brandname Buscopan among others, is an anticholinergic medication used to treat crampy abdominal pain, esophageal spasms, renal colic, and bladder spasms. It is also used to improve respiratory secretions at the end of life.",
"source": {
"name": "Wikipedia",
"link": "https://en.wikipedia.org/wiki/Hyoscine_butylbromide"
},
"formula": "C₂₁H₃₀BrNO₄",
"molar_mass": "440.371 g/mol",
"chem_spider_id": "16736107",
"trade_name": "Buscopan, others",
"pub_chem_cid": "6852391",
"ch_ebi_id": "32123",
"people_also_search_for": "Scopolamine, Metamizole, MORE"
}
Disclaimer, I work for SerpApi.
I am trying to scrape google news using the following code:
from bs4 import BeautifulSoup
import requests
import time
from random import randint
def scrape_news_summaries(s):
time.sleep(randint(0, 2)) # relax and don't let google be angry
r = requests.get("http://www.google.co.uk/search?q="+s+"&tbm=nws")
content = r.text
news_summaries = []
soup = BeautifulSoup(content, "html.parser")
st_divs = soup.findAll("div", {"class": "st"})
for st_div in st_divs:
news_summaries.append(st_div.text)
return news_summaries
l = scrape_news_summaries("T-Notes")
#l = scrape_news_summaries("""T-Notes""")
for n in l:
print(n)
Even though this bit of code was working before, I now can't figure out why it's not working anymore. Is it possible that I've been banned by google since I only ran the code 3 or four times? (I tried using Bing News with unfortunate empty results too...)
Thanks.
I tried running the code and it works fine on my computer.
You could try printing the status code for the request, and see if it's anything other than 200.
from bs4 import BeautifulSoup
import requests
import time
from random import randint
def scrape_news_summaries(s):
time.sleep(randint(0, 2)) # relax and don't let google be angry
r = requests.get("http://www.google.co.uk/search?q="+s+"&tbm=nws")
print(r.status_code) # Print the status code
content = r.text
news_summaries = []
soup = BeautifulSoup(content, "html.parser")
st_divs = soup.findAll("div", {"class": "st"})
for st_div in st_divs:
news_summaries.append(st_div.text)
return news_summaries
l = scrape_news_summaries("T-Notes")
#l = scrape_news_summaries("""T-Notes""")
for n in l:
print(n)
https://www.scrapehero.com/how-to-prevent-getting-blacklisted-while-scraping/ for a list of status code that's a sign you have been banned.
Using time.sleep(randint(0, 2)) is not the most reliable way to bypass blocks 👀
There are several steps to bypass blocking:
Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Using the User Agent is more reliable (but up to a certain point).
Having one user-agent is not enough but you can rotate them to make it a bit more reliable.
Sometimes passing only user-agent isn't enough. You can pass additional headers. See more HTTP request headers that you can send while making a request.
The most reliable way to bypass blocking is residential proxies. Residential proxies allow you to choose a specific location (country, city, or mobile carrier) and surf the web as a real user in that area. Proxies can be defined as intermediaries that protect users from general web traffic. They act as buffers while also concealing your IP address.
Using a non-overused proxies is the best option. You can scrape a lot of public proxies and save them to a list(), or save it to .txt file to save memory and iterate over them while making a request to see what's the results would be, and then move to different types of proxies if the result is not what you were looking for.
You can be whitelisted. Get whitelisted means to add IP addresses to allow lists in website which explicitly allows some identified entities to access a particular privilege, i.e. it is a list of things allowed when everything is denied by default. One of the ways to become whitelisted is you can regularly do something useful for "them" based on scraped data which could lead to some insights.
For more information on how to bypass blocking, you can read the Reducing the chance of being blocked while web scraping blog post.
You can also check the response with status_code. If a bad request was made (client error 4XX or server error response 5XX), then it can be raised with Response.raise_for_status(). But if the status code for the request is 200 and we call raise_for_status() we get None. This means that there are no errors and everything is fine.
Code and full example in online IDE:
from bs4 import BeautifulSoup
from random import randint
import requests, time, json, lxml
def scrape_news_summaries(query):
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": query,
"hl": "en-US", # language
"gl": "US", # country of the search, US -> USA
"tbm": "nws", # google news
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
time.sleep(randint(0, 2)) # relax and don't let google be angry
html = requests.get("http://www.google.co.uk/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
news_summaries = []
if html.status_code == 200:
for result in soup.select(".WlydOe"):
source = result.select_one(".NUnG9d span").text
title = result.select_one(".mCBkyc").text
link = result["href"]
snippet = result.select_one(".GI74Re").text
date = result.select_one(".ZE0LJd span").text
news_summaries.append({
"source": source,
"title": title,
"link": link,
"snippet": snippet,
"date": date
})
return news_summaries
print(json.dumps(scrape_news_summaries("T-Notes"), indent=2, ensure_ascii=False))
Output:
[
{
"source": "Barchart.com",
"title": "U.S. Stocks Undercut As T-Note Yield Rises",
"link": "https://www.barchart.com/story/news/9572904/u-s-stocks-undercut-as-t-note-yield-rises",
"snippet": "T-note prices are seeing some supply pressure ahead of today's Treasury \nsale of $42 billion of 3-year T-notes. The Treasury will then sell $35...",
"date": "2 days ago"
},
{
"source": "Barchart.com",
"title": "U.S. Stocks Rally As T-Note Yield Eases",
"link": "https://www.barchart.com/story/news/9548700/u-s-stocks-rally-as-t-note-yield-eases",
"snippet": "Stocks are seeing support from strong overseas stock markets and today's \n-5.3 bp decline in the 10-year T-note yield to 2.774%.",
"date": "2 days ago"
},
{
"source": "PC Gamer",
"title": "The internet won't stop roasting this new Forspoken trailer",
"link": "https://www.pcgamer.com/the-internet-wont-stop-roasting-this-new-forspoken-trailer/",
"snippet": "He covers all aspects of the industry, from new game announcements and \npatch notes to legal disputes, Twitch beefs, esports, and Henry Cavill.",
"date": "14 hours ago"
},
{
"source": "ESPN",
"title": "Fantasy football daily notes - Backfield rumors in New England, Miami",
"link": "https://www.espn.com/fantasy/football/story/_/id/34383436/fantasy-football-daily-notes-backfield-rumors-new-england-miami",
"snippet": "Right now, the second-year receiver is a nice value in fantasy football \ndrafts, the WR37 in our draft trends, which isn't bad for a player...",
"date": "1 hour ago"
},
{
"source": "Cincinnati Bengals",
"title": "Bengals Training Camp Notes: Dax Hill, Jackson Carman, Joe ...",
"link": "https://www.bengals.com/news/jackson-carman-dax-hill-get-starts-in-preseason-opener",
"snippet": "Jackson Carman gets Friday's start at left guard. Bengals head coach Zac \nTaylor won't play most of his starters in Friday's (7:...",
"date": "20 hours ago"
},
{
"source": "Hoops Rumors",
"title": "Texas Notes: Wood, Mavericks, Martin, T. Jones",
"link": "https://www.hoopsrumors.com/2022/08/texas-notes-wood-mavericks-martin-t-jones.html",
"snippet": "Texas Notes: Wood, Mavericks, Martin, T. Jones. August 6th 2022 at 10:59pm \nCST by Arthur Hill. Christian Wood told WFAA TV that he's “counting my \nblessings”...",
"date": "4 days ago"
},
{
"source": "Yahoo! Sports",
"title": "Instant View: US CPI unchanged in July, raises hopes of Fed slowing",
"link": "https://sports.yahoo.com/instant-view-us-cpi-unchanged-125340947.html",
"snippet": "BONDS: The yield on 10-year Treasury notes was down 5.6 basis points to \n2.741%; The two-year U.S. Treasury yield, was down 16.3 basis points...",
"date": "1 day ago"
},
{
"source": "NFL Trade Rumors",
"title": "NFC Notes: Bears, Lions, Packers, Vikings - NFLTradeRumors ...",
"link": "https://nfltraderumors.co/nfc-notes-bears-lions-packers-vikings-129/",
"snippet": "Regarding Bears LB Roquan Smith requesting a trade, DE Robert Quinn \nbelieves that Smith is deserving of a new contract: “You don't get a lot...",
"date": "14 hours ago"
},
{
"source": "ESPN",
"title": "Fantasy football daily notes - Geno Smith, Albert Okwuegbunam trending up",
"link": "https://www.espn.com/fantasy/football/story/_/id/34378640/fantasy-football-daily-notes-geno-smith-albert-okwuegbunam-trending-up",
"snippet": "Read ESPN's fantasy football daily notes every weekday to stay caught ... \nDon't overlook him in deep or tight end premium fantasy formats.",
"date": "1 day ago"
},
{
"source": "Hoops Rumors",
"title": "Atlantic Notes: Quickley, Durant, Sixers, Raptors, R. Williams",
"link": "https://www.hoopsrumors.com/2022/08/atlantic-notes-quickley-durant-sixers-raptors-r-williams.html",
"snippet": "However, Morant said he and the former Kentucky standout aren't paying \nattention to that trade speculation as they attempt to hone...",
"date": "41 mins ago"
}
]
If you don't want to figure out how to use proxies, user-agent rotation, captcha solving and more things, there's an already made APIs to make it work, for example, Google News Result API from SerpApi:
from serpapi import GoogleSearch
import os
def serpapi_code(query):
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google", # search engine
"q": query, # search query
"tbm": "nws", # google news
"location": "Dallas", # your location
# other parameters
}
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
results = search.get_dict() # JSON -> Python dict
news_summaries = []
for result in results["news_results"]:
news_summaries.append({
"source": result["source"],
"title": result["title"],
"link": result["link"],
"snippet": result["snippet"],
"date": result["date"]
})
return news_summaries
print(json.dumps(serpapi_code("T-Notes"), indent=2, ensure_ascii=False))
The output will be the same.
If you need a more detailed explanation about scraping Google News, have a look at Web Scraping Google News with Python blog post or you can watch the video.
Disclaimer, I work for SerpApi.