I'm a newbie with python.
In PyCharm I wrote this code:
import requests
from bs4 import BeautifulSoup
response = requests.get(f"https://www.google.com/search?q=fitness+wear")
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Instead getting the HTML of the search results, what I get is the HTML of the following page
I use the same code within a script on pythonanywhere.com and it works perfectly. I've tried lots of the solutions I found but the result is always the same, so now I'm stuck with it.
I think this should work:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
url = f"https://www.google.com/search?q=fitness+wear"
headers = {
"referer":"referer: https://www.google.com/",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
}
s.post(url, headers=headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
It uses a request session and a post request to create any initial cookies (not fully sure on this) and then allows you scrape.
If you open up a private Window in your browser and go to google.com, you should see the same pop-up prompting you to give your consent. This is, because you don't have session cookies send.
You have different options to tackle this.
One would be sending the cookies you can observe on the website with the request directly like so:
import requests
cookies = {"CONSENT":"YES+shp.gws-20210330-0-RC1.de+FX+412", ...}
resp = request.get(f"https://www.google.com/search?q=fitness+wear",cookies=cookies)
The solution #Dimitriy Kruglikov uses is a lot cleaner though and using sessions is a good way of having a persistent Session with the website.
Google doesn't block you, you still can extract data from the HTML.
Using cookies isn't very convenient and using session with post and get requests will lead to a bigger amount of traffic.
You can remove this popup by either using decompose() or extract() BS4 methods:
annoying_popup.decompose() will completely destroy it and its contents. Documentation.
annoying_popup.extract() will make another html tree: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. Documentation.
After that, you can scrape everything you need as well as without removing it.
See this Organic Results extraction I did recently. It scrapes title, summary, and link from Google Search Results.
Alternatively, you can use Google Search Engine Results API from SerpApi. Check out the Playground.
Code and example in online IDE:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "fus ro dah",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(f"Title: {result['title']}\nSnippet: {result['snippet']}\nLink: {result['link']}\n")
Output:
Title: Skyrim - FUS RO DAH (Dovahkiin) HD - YouTube
Snippet: I looked around for a fan made track that included Fus Ro Dah, but the ones that I found were pretty bad - some ...
Link: https://www.youtube.com/watch?v=JblD-FN3tgs
Title: Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Snippet: If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: "Fus Rah Do" instead of the proper "Fus Ro Dah." ...
Link: https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
Title: Fus Ro Dah | Know Your Meme
Snippet: Origin. "Fus Ro Dah" are the words for the "unrelenting force" thu'um shout in the game Elder Scrolls V: Skyrim. After reaching the first town of ...
Link: https://knowyourmeme.com/memes/fus-ro-dah
Title: Fus ro dah - Urban Dictionary
Snippet: 1. A dragon shout used in The Elder Scrolls V: Skyrim. 2.An international term for oral sex given by a female. ex.1. The Dragonborn yelled "Fus ...
Link: https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
Part of JSON:
"organic_results": [
{
"position": 1,
"title": "Unrelenting Force (Skyrim) | Elder Scrolls | Fandom",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)",
"displayed_link": "https://elderscrolls.fandom.com › wiki › Unrelenting_F...",
"snippet": "If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: \"Fus Rah Do\" instead of the proper \"Fus Ro Dah.\" ...",
"sitelinks": {
"inline": [
{
"title": "Location",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Location"
},
{
"title": "Effect",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Effect"
},
{
"title": "Usage",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Usage"
},
{
"title": "Word Wall",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Word_Wall"
}
]
},
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:K3LEBjvPps0J:https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)+&cd=17&hl=en&ct=clnk&gl=us"
}
]
Disclaimer, I work for SerpApi.
Related
I am completely new to Web Scraping using Python BeautifulSoup. Using SO and other blogs I have tried to build my first piece of code.
Objective: Given a string as input, the code will search in https://www.google.com and get the search results with following information:
Title
Brief Description
Link
Say I want to search "Core Banking Solution by Accenture". To do so :
search_str ='Core Banking Solution Accenture'
url = 'https://www.google.com/search?q=' + search_str
page = requests.get(url)
soup = bs(page.content,'lxml')
for node in soup.find_all('a'):
print("Inner Text : {}".format(node.text))
for h in soup.find_all('h3'):
print("Title : {}".format(h.text)
print("Link :{}".format(node.get('href'))
I am getting an error for title.
AttributeError: 'NoneType' object has no attribute 'text'
Clearly, it is not getting any title object from soup.find_all('a').
Question: What trick I have to apply here to get the title? Does soup.find_all('a') really contains any title` tag? What I am missing here?
Update: Based on the suggestions received, I have updated code piece. Now it is working. Although need to check the results in details.
In this case, node.get("title") returns None, as there is no title-attribute in the a-tag. This causes an error when accessing the .text-attribute. Instead, we may examine the HTML to figure out which part of the tag constitutes the title. See the example below.
from bs4 import BeautifulSoup
html = '''
<a href="https://html.com/attributes/a-title">
<br>
<h3>When To Use A Title In HTML</h3>
<div>
<cite role="text">https://html.com
<span role="text"> › attributes › a-title</span>
</cite>
</div>
</a>
'''
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
for tag in a_tags:
print(tag.get("title")) # prints None
print(tag.h3.text) # prints "When To Use A Title In HTML"
In this search result, the title is found in the contents of the h3-tag. We can extract it using tag.h3.text. The browser devtools are a great resource when trying to figure out how to navigate the tags.
You might receive an empty output because of not using user-agent to act as a "real" user request.
If the request is being blocked, the response would contain different HTML with different elements/selectors and some sort of an error and that's why you might get an AttributeError: 'NoneType' object has no attribute 'text' error.
Also, when not specifying user-agent in request headers, requests library defaults to python-requests which websites understand and might block because they understand that it's a script that sends a request.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "Core Banking Solution Accenture", # search query
"gl": "us", # country of the search
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
results = []
for index, result in enumerate(soup.select(".tF2Cxc"), start=1):
title = result.select_one(".DKV0Md").text
link = result.select_one(".yuRUbf a")["href"]
try:
snippet = result.select_one("#rso .lyLwlc").text
except: snippet = None
results.append({
"position": index,
"title": title,
"link": link,
"snippet": snippet
})
print(json.dumps(results, indent=2, ensure_ascii=False))
Part of the output:
[
{
"position": 1,
"title": "Core Banking Services | Accenture",
"link": "https://www.accenture.com/us-en/services/banking/core-banking",
"snippet": "Accenture brings together the skills, technologies and capabilities to renew core banking systems in ways that put customers first and enable banks to release ..."
},
{
"position": 2,
"title": "Accenture In Consortium Hired to Implement Core Banking ...",
"link": "https://newsroom.accenture.com/industries/banking/accenture-in-consortium-hired-to-implement-core-banking-solution-at-polands-largest-bank.htm",
"snippet": "WARSAW; Aug. 18, 2003 –Accenture has signed a $114 million contract to implement a core-banking platform at PKO BP, Poland's largest bank, as part of a ..."
}, ... other results
]
Alternatively, you can achieve it by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create the parser from scratch, maintain it, figure out how to bypass blocks from Google or other search engines and how scale it.
Code to integrate:
from serpapi import GoogleSearch
import json, os
params = {
"api_key": "serpapi_key", # your serpapi api key
"engine": "google", # search engine
"q": "Core Banking Solution Accenture", # search query
"google_domain": "google.com", # google domain
"gl": "us", # country to search from
"hl": "en" # language
# other parameters
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
data = []
for result in results["organic_results"]:
data.append({
"position": result.get("position"),
"title": result.get("title"),
"link": result.get("link"),
"snippet": result.get("snippet")
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Part of the output:
[
{
"position": 1,
"title": "Core Banking Services | Accenture",
"link": "https://www.accenture.com/us-en/services/banking/core-banking",
"snippet": "Accenture brings together the skills, technologies and capabilities to renew core banking systems in ways that put customers first and enable banks to release ..."
},
{
"position": 2,
"title": "Accenture In Consortium Hired to Implement Core Banking ...",
"link": "https://newsroom.accenture.com/industries/banking/accenture-in-consortium-hired-to-implement-core-banking-solution-at-polands-largest-bank.htm",
"snippet": "WARSAW; Aug. 18, 2003 –Accenture has signed a $114 million contract to implement a core-banking platform at PKO BP, Poland's largest bank, as part of a ..."
}, ... other results
]
Disclaimer, I work for SerpApi.
I am trying to scrape the title of every item on an ebay page. This is the page. I first tried to scrape the title of the first listing (lines 5-7 of my code) , and I was successful as the title of the first listing gets printed. But when I try to scrape every single title on the ebay page (lines 8-10), nothing gets printed. Is there a flaw in my logic? Thanks!
1. from bs4 import BeautifulSoup
2. import requests
3. source = requests.get("https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=hippo&_sacat=0").text
4. soup = BeautifulSoup(source, "lxml")
5. listing = soup.find("li", class_=("s-item s-item--watch-at-corner"))
6. title = soup.find("h3", class_=("s-item__title")).text
7. print(title)
8. for listing in soup.find_all("li", class_=("s-item s-item--watch-at-corner")):
9. title = soup.find("h3", class_=("s-item__title")).text
10. print(title)
After a quick glance at the docs:
BeautifulSoup's .find_all() method returns a list (as one would expect). However, it seems to me that the .find() in your for loop is just querying the response again, rather than doing something with the list you're generating. I would expect either extracting the titles manually, such as:
title = listing['some_property']
or perhaps there's another method provided by the library you're using.
By looking at the code you haven't checked the type of the class.
from bs4 import BeautifulSoup
import requests
source=requests.get("https://www.ebay.com/sch/i.html_from=R40&_trksid=p2380057.m570.l1313&_nkw=hippo&_sacat=0").text
soup = BeautifulSoup(source, "lxml")
listing = soup.find("li", class_=("s-item s-item--watch-at-corner"))
title = soup.find("h3", class_=("s-item__title")).text
print(type(listing))
This returns the result of
<class 'NoneType'>
So the parsing ends as there are no li tags to find
You're calling find("h3", class_=("s-item__title") on the soup every time, you need to call it on every listing in the loop or it will always fetch the first title. Also, keep in mind there were a couple of hidden results on the eBay page for whatever reason, maybe check that out and see if you want to ignore or include those as well. I added enumerate function in the loop just to keep track of the number of the results.
I used this selector to find all the listing on the chrome dev tool li.s-item.s-item--watch-at-corner h3.s-item__title
from bs4 import BeautifulSoup
import requests
source = requests.get("https://www.ebay.com/sch/i.html?_from=R40&_trksid=p2380057.m570.l1313&_nkw=hippo&_sacat=0").text
soup = BeautifulSoup(source, "lxml")
listing = soup.find("li", class_=("s-item s-item--watch-at-corner"))
title = soup.find("h3", class_=("s-item__title")).text
print(title)
for i, listing in enumerate(soup.find_all("li", class_=("s-item s-item--watch-at-corner"))):
title = listing.find("h3", class_=("s-item__title")).text
print("[{}] ".format(i) + title)
Result:
[0] Pewter Hippopotamus Hippo Figurine
[1] Hippopotamus Figurine 1.5" Gemstone Opalite Crystal Healing Carved Statue Decor
[2] hippopotamus coffee cafe picture animal hippo art tile gift
[3] NEW! Miniature Bronze Hippo Figurine Miniature Bronze Statue Animal Collectible
[4] Hippopotamus Gzhel porcelain figurine hippo handmade
[5] Hippopotamus Gzhel porcelain figurine hippo souvenir handmade and hand-painted
....
Have a look at the SelectorGadget Chrome extension to easily pick selectors by clicking on the desired element in your browser which is not always working perfectly, if the page heavily uses JS (in this case we can).
There is also the possibility of blocking the request, if using requests as default user-agent in requests library is a python-requests.
An additional step could be to rotate user-agent, for example, to switch between PC, mobile, and tablet, as well as between browsers e.g. Chrome, Firefox, Safari, Edge and so on.
Check out the code in the online IDE
from bs4 import BeautifulSoup
import requests, json, lxml
# https://requests.readthedocs.io/en/latest/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36",
}
params = {
'_nkw': 'hippo', # search query
}
data = []
page = requests.get('https://www.ebay.com/sch/i.html', params=params, headers=headers, timeout=30)
soup = BeautifulSoup(page.text, 'lxml')
for title in soup.select(".s-item__title span"):
if "Shop on eBay" in title:
pass
else:
data.append({"title" : title.text})
print(json.dumps(data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Hippopotamus Gzhel porcelain figurine hippo souvenir handmade and hand-painted"
},
{
"title": "Coad Peru Signed 1 1/2\" HIPPO Clay Pottery Collectible Figurine"
},
{
"title": "Glass Hippo Hippopotamus figurine \"murano\" handmade"
},
{
"title": "2 Hand Carved Soapstone Brown Hippopotamus Hippo Animal Figurine Paperweight"
},
{
"title": "Schleich Hippo D-73527 Hippopotamus Mouth Open Wildlife Toy Figure 2012"
},
# ...
]
As an alternative, you can use Ebay Organic Results API from SerpApi. It's a paid API with a free plan that handles blocks and parsing on their backend.
Example code:
from serpapi import EbaySearch
import os, json
params = {
"api_key": os.getenv("API_KEY"), # serpapi api key
"engine": "ebay", # search engine
"ebay_domain": "ebay.com", # ebay domain
"_nkw": "hippo" # search query
}
search = EbaySearch(params) # where data extraction happens
data = []
results = search.get_dict() # JSON -> Python dict
for organic_result in results.get("organic_results", []):
title = organic_result.get("title")
data.append({
"title" : title
})
print(json.dumps(data, indent=2))
Output:
[
{
"title": "Schleich Hippo D-73527 Hippopotamus Mouth Open Wildlife Toy Figure 2012"
},
{
"title": "Vintage WOODEN SCULPTURE Hippo HIPPOPOTAMUS"
},
{
"title": "Hippopotamus Gzhel porcelain figurine hippo souvenir handmade and hand-painted"
},
{
"title": "Glass Hippo Hippopotamus figurine \"murano\" handmade"
},
# ...
]
I am trying to scrape google knowledge panel to retrieve the name of drugs if they do not appear in google search. For instance if I look for "Buscopan" in Google, the appearing webpage looks like this:
Now, what I am trying to do with the code shown is take the term "Scopolamina-N-butilbromuro" in the knowledge panel but am actually unable to retrieve it in the html code once I inspect the element. To be precise. The code I am implementing together with the error message is as follows:
import requests
from bs4 import BeautifulSoup
URL
url = "https://www.google.com/search?client=safari&rls=en&q="+"buscopan"+"&ie=UTF-8&oe=UTF-8"
# Sending HTTP request
req = requests.get(url)
# Pulling HTTP data from internet
sor = BeautifulSoup(req.text, "html.parser")
temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
print(temp)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-ef5599a1a1fc> in <module>
13 # Finding temperature in Celsius
14 #temp = sor.find("h2", class_='qrShPb').text
---> 15 temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
16
17
AttributeError: 'NoneType' object has no attribute 'text'
I don't know what I am doing wrong. I think the bit of html code I need to look at is the following:
<h2 class="qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe" data-local-attribute="d3bn" data-attrid="title" data-ved="2ahUKEwjujfLcgO7rAhWKjosKHSiBAFEQ3B0oATASegQIEBAL"></h2>
Of course the rest of the html code is in the picture reported, but if you need a bigger version, please, do not esitate!
Any suggestion?
Thank you,
Federico
To get correct result page from Google search, specify User-Agent HTTP header. For example:
import requests
from bs4 import BeautifulSoup
params = {
'q': 'buscopan', # <-- change to your keyword
'hl': 'it' # <-- change to `en` for english results
}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.google.com/search'
soup = BeautifulSoup(requests.get(url, params=params, headers=headers).content, 'html.parser')
print(soup.select_one('h2[data-attrid="title"]').text)
Prints:
Scopolamina-N-butilbromuro
Alternatively, to Andrej Kesely solution, you can use third-party Google Knowledge Graph API from SerpApi. It's a paid API with a free plan. Check out the Playground to test.
Code to integrate and full example in the online IDE:
from serpapi import GoogleSearch
import os
params = {
"q": "Buscopan",
"google_domain": "google.com",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['knowledge_graph']['title']
print(title)
Output:
Butylscopolamine
Part of JSON Knowledge Graph output:
"knowledge_graph": {
"title": "Butylscopolamine",
"type": "Medication",
"description": "Hyoscine butylbromide, also known as scopolamine butylbromide and sold under the brandname Buscopan among others, is an anticholinergic medication used to treat crampy abdominal pain, esophageal spasms, renal colic, and bladder spasms. It is also used to improve respiratory secretions at the end of life.",
"source": {
"name": "Wikipedia",
"link": "https://en.wikipedia.org/wiki/Hyoscine_butylbromide"
},
"formula": "C₂₁H₃₀BrNO₄",
"molar_mass": "440.371 g/mol",
"chem_spider_id": "16736107",
"trade_name": "Buscopan, others",
"pub_chem_cid": "6852391",
"ch_ebi_id": "32123",
"people_also_search_for": "Scopolamine, Metamizole, MORE"
}
Disclaimer, I work for SerpApi.
I am trying to scrape google search results using the following code. I want to take the title and the url of the first page of the results and then continue by scraping the next pages of the search results too.
This is a sample of code that I just started writing:
from urllib.request import urlopen as uReq
import urllib.request
from bs4 import BeautifulSoup as soup
paging_url = "https://www.google.gr/search?q=donald+trump&ei=F91FW8XBGYjJsQHQwaWADA&start=110&sa=N&biw=811&bih=662"
req = urllib.request.Request("https://www.google.gr/search?q=donald+trump&ei
=F91FW8XBGYjJsQHQwaWADA&start=110&sa=N&biw=811&bih=662",headers = {'User-Agent':"Magic Browser"})
UClient = uReq(req) # downloading the url
page_html = UClient.read()
UClient.close()
page_soup = soup(page_html, "html.parser")
I noticed that all google results have a common class named "g". So I wrote the following command:
results= page_soup.findAll("div",{"class":"g"})
But after testing the results taken are not the same that I see when I visit the initial Url.
Moreover some div tags such as:
<div data-hveid="38" data-ved="0ahUKEwjGp7XEj5fcAhXMDZoKHRf8DJMQFQgmKAAwAA">
and
<div class="rc">
can not be seen in the tree that BeautifulSoup produces. Meaning I can not use findAll function to locate objects inside those tags because BeautifulSoup acts like they do not exist.
Why all this happens?
I would never scrape Google directly via raw http requests. Google can detect it so easily. In order to be not detected, I suggest using a automated browser like Chrome with selenium.
In your example, the problem is that Google delivers a differnt HTML version of it's SERP page because it detects the low level http scraping.
There exists open source libraries that handle all the difficult parts of scraping. For example GoogleScraper, which is a tool written in Python3 that supports three different scraping modes: Http raw scraping, selenium mode (with real browsers) and asynchronous http mode.
To paginate through all Google Search pages, you need to use while True loop to scrape the first and subsequent pages.
Pagination runs while the next button exists (is determined by the presence of the button selector on the page, in our case the CSS selector is .d6cvqb a[id=pnnext]), you need to increase the value of the ["start"] parameter by 10 to access the next page if it is present, otherwise we need to break out of the while loop:
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
Make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Check code with using pagination in online IDE.
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "donald trump", # query
"hl": "en", # language
"gl": "us", # country of the search, US -> USA
"start": 0, # number page by default up to 0
# "filter": 0 # shows more than 10 pages. By default up to ~10-13 if filter = 1.
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36"
}
page_num = 0
website_data = []
while True:
page_num += 1
print(f"page: {page_num}")
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select(".tF2Cxc"):
title = f'Title: {result.select_one("h3").text}'
link = f'Link: {result.select_one("a")["href"]}'
website_data.append({
"title" : title,
"link" : link,
})
if soup.select_one('.d6cvqb a[id=pnnext]'):
params["start"] += 10
else:
break
print(json.dumps(website_data, indent=2, ensure_ascii=False))
Example output:
[
{
"title": "Title: Donald J. Trump: Home",
"link": "Link: https://www.donaldjtrump.com/"
},
{
"title": "Title: Donald Trump - Wikipedia",
"link": "Link: https://en.wikipedia.org/wiki/Donald_Trump"
},
{
"title": "Title: Donald J. Trump | The White House",
"link": "Link: https://www.whitehouse.gov/about-the-white-house/presidents/donald-j-trump/"
},
{
"title": "Title: Donald Trump - CNBC",
"link": "Link: https://www.cnbc.com/donald-trump/"
},
{
"title": "Title: Donald Trump | The Hill | Page 1",
"link": "Link: https://thehill.com/people/donald-trump/"
},
# ...
]
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that it will bypass blocks (including CAPTCHA) from Google, no need to create the parser and maintain it.
Code example:
from serpapi import GoogleSearch
from urllib.parse import urlsplit, parse_qsl
import json, os
params = {
"api_key": os.getenv("API_KEY"), # serpapi key
"engine": "google", # serpapi parser engine
"q": "donald trump", # search query
"num": "100" # number of results per page (100 per page in this case)
# other search parameters: https://serpapi.com/search-api#api-parameters
}
search = GoogleSearch(params) # where data extraction happens
organic_results_data = []
page_num = 0
while True:
results = search.get_dict() # JSON -> Python dictionary
page_num += 1
for result in results["organic_results"]:
organic_results_data.append({
"page_num": page_num,
"title": result.get("title"),
"link": result.get("link"),
"displayed_link": result.get("displayed_link"),
})
if "next_link" in results.get("serpapi_pagination", []):
search.params_dict.update(dict(parse_qsl(urlsplit(results.get("serpapi_pagination").get("next_link")).query)))
else:
break
print(json.dumps(organic_results_data, indent=2, ensure_ascii=False))
Output:
[
{
"page_num": 1,
"title": "Donald J. Trump: Home",
"link": "https://www.donaldjtrump.com/",
"displayed_link": "https://www.donaldjtrump.com"
},
{
"page_num": 1,
"title": "Donald Trump - Wikipedia",
"link": "https://en.wikipedia.org/wiki/Donald_Trump",
"displayed_link": "https://en.wikipedia.org › wiki › Donald_Trump"
},
{
"page_num": 1,
"title": "Donald J. Trump | The White House",
"link": "https://www.whitehouse.gov/about-the-white-house/presidents/donald-j-trump/",
"displayed_link": "https://www.whitehouse.gov › ... › Presidents"
},
{
"page_num": 1,
"title": "Donald Trump - CNBC",
"link": "https://www.cnbc.com/donald-trump/",
"displayed_link": "https://www.cnbc.com › donald-trump"
},
# ...
]
Disclaimer, I work for SerpApi.
Goal: I would like to verify, if a specific Google search has a suggested result on the right hand side and - in case of such a suggestion - scrape some information like company type / address / etc.
Approach: I wanted to use a Python scraper with Requests and BeautifulSoup4
import bs4
import requests
address='https://www.google.co.ve/?gws_rd=cr&ei=DgBqVpWJMoPA-gHy25fACg#q=caracas+arepa'
page = requests.get(address)
soup = bs4.BeautifulSoup(page.content,'html.parser')
print (soup.prettify())
Issue:
The requested page does not include the search results (I am not sure if some variable on the Google page is set to invisible?), Rather only the header and footer of the Google page
Questions:
Alternative ways to obtain the described information? Any ideas?
Once I obtained results with the described method, but the respective address was constructed differently (I remember many numbers in the Google URL, but unfortunately cannot reproduce the search address). Therefore: Is there a requirement of the Google URL so that it can be scraped via requests.get?
The best way to get information from a service like Google Places will almost always be the official API. That said, if you're dead set on scraping, it's likely that what's returned by the HTTP request is meant for a browser to render. What BeautifulSoup does is not equivalent to rendering the data it receives, so it's very likely you're just getting useless empty containers that are later filled out dynamically.
I think your question is similar to google-search-with-python-reqeusts, maybe you could get some help from that~
And I agree with LiterallyElvis, API is better idea than crawl it directly.
Finally if you want to use requests for this work, I recommend to use PhantomJS and selenium to mock browser works, as Google should use some AJAX tech which makes different views between real browser and crawler.
As in country of difficult to visit Google, I couldn't repeat your problem directly, the above are sth I could think about, wish it helps
You need select_one() element (container) that contains all the needed data and check if it exists, and if so, scrape the data.
Make sure you're using user-agent to act as a "real" user visit, otherwise your request might be blocked or you receive a different HTML with different selectors. Check what's your user-agent.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "caracas arepa bar google",
"gl": "us"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# if right side knowledge graph is present -> parse the data.
if soup.select_one(".liYKde"):
place_name = soup.select_one(".PZPZlf.q8U8x span").text
place_type = soup.select_one(".YhemCb+ .YhemCb").text
place_reviews = soup.select_one(".hqzQac span").text
place_rating = soup.select_one(".Aq14fc").text
print(place_name, place_type, place_reviews, place_rating, sep="\n")
# output:
'''
Caracas Arepa Bar
Venezuelan restaurant
1,123 Google reviews
4.5
'''
Alternatively, you can achieve the same thing using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan.
The biggest difference is that you don't need to figure out how to parse the data, increase the number of requests, bypass blocks from Google, and other search engines.
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "caracas arepa bar place",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps([results["knowledge_graph"]], indent=2))
# part of the output:
'''
[
{
"title": "Caracas Arepa Bar",
"type": "Venezuelan restaurant",
"place_id": "ChIJVcQ2ll9ZwokRwmkvsArPXyo",
"website": "http://caracasarepabar.com/",
"description": "Arepa specialist offering creative, low-priced renditions of the Venezuelan corn-flour staple.",
"local_map": {
"image": "https://www.google.com/maps/vt/data=TF2Rd51PtEnU2M3pkZHYHKdSwhMDJ_ZwRfg0vfwlDRAmv1u919sgFl8hs_lo832ziTWxCZM9BKECs6Af-TA1hh0NLjuYAzOLFA1-RBEmj-8poygymcRX2KLNVTGGZZKDerZrKW6fnkONAM4Ui-BVN8XwFrwigoqqxObPg8bqFIgeM3LPCg",
"link": "https://www.google.com/maps/place/Caracas+Arepa+Bar/#40.7131972,-73.9574167,15z/data=!4m2!3m1!1s0x0:0x2a5fcf0ab02f69c2?sa=X&hl=en",
"gps_coordinates": {
"latitude": 40.7131972,
"longitude": -73.9574167,
"altitude": 15
}
} ... much more results including place images, popular times, user reviews.
}
]
'''
Disclaimer: I work for SerpApi.