I am trying to scrape google knowledge panel to retrieve the name of drugs if they do not appear in google search. For instance if I look for "Buscopan" in Google, the appearing webpage looks like this:
Now, what I am trying to do with the code shown is take the term "Scopolamina-N-butilbromuro" in the knowledge panel but am actually unable to retrieve it in the html code once I inspect the element. To be precise. The code I am implementing together with the error message is as follows:
import requests
from bs4 import BeautifulSoup
URL
url = "https://www.google.com/search?client=safari&rls=en&q="+"buscopan"+"&ie=UTF-8&oe=UTF-8"
# Sending HTTP request
req = requests.get(url)
# Pulling HTTP data from internet
sor = BeautifulSoup(req.text, "html.parser")
temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
print(temp)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-ef5599a1a1fc> in <module>
13 # Finding temperature in Celsius
14 #temp = sor.find("h2", class_='qrShPb').text
---> 15 temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
16
17
AttributeError: 'NoneType' object has no attribute 'text'
I don't know what I am doing wrong. I think the bit of html code I need to look at is the following:
<h2 class="qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe" data-local-attribute="d3bn" data-attrid="title" data-ved="2ahUKEwjujfLcgO7rAhWKjosKHSiBAFEQ3B0oATASegQIEBAL"></h2>
Of course the rest of the html code is in the picture reported, but if you need a bigger version, please, do not esitate!
Any suggestion?
Thank you,
Federico
To get correct result page from Google search, specify User-Agent HTTP header. For example:
import requests
from bs4 import BeautifulSoup
params = {
'q': 'buscopan', # <-- change to your keyword
'hl': 'it' # <-- change to `en` for english results
}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.google.com/search'
soup = BeautifulSoup(requests.get(url, params=params, headers=headers).content, 'html.parser')
print(soup.select_one('h2[data-attrid="title"]').text)
Prints:
Scopolamina-N-butilbromuro
Alternatively, to Andrej Kesely solution, you can use third-party Google Knowledge Graph API from SerpApi. It's a paid API with a free plan. Check out the Playground to test.
Code to integrate and full example in the online IDE:
from serpapi import GoogleSearch
import os
params = {
"q": "Buscopan",
"google_domain": "google.com",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['knowledge_graph']['title']
print(title)
Output:
Butylscopolamine
Part of JSON Knowledge Graph output:
"knowledge_graph": {
"title": "Butylscopolamine",
"type": "Medication",
"description": "Hyoscine butylbromide, also known as scopolamine butylbromide and sold under the brandname Buscopan among others, is an anticholinergic medication used to treat crampy abdominal pain, esophageal spasms, renal colic, and bladder spasms. It is also used to improve respiratory secretions at the end of life.",
"source": {
"name": "Wikipedia",
"link": "https://en.wikipedia.org/wiki/Hyoscine_butylbromide"
},
"formula": "C₂₁H₃₀BrNO₄",
"molar_mass": "440.371 g/mol",
"chem_spider_id": "16736107",
"trade_name": "Buscopan, others",
"pub_chem_cid": "6852391",
"ch_ebi_id": "32123",
"people_also_search_for": "Scopolamine, Metamizole, MORE"
}
Disclaimer, I work for SerpApi.
Related
I am completely new to Web Scraping using Python BeautifulSoup. Using SO and other blogs I have tried to build my first piece of code.
Objective: Given a string as input, the code will search in https://www.google.com and get the search results with following information:
Title
Brief Description
Link
Say I want to search "Core Banking Solution by Accenture". To do so :
search_str ='Core Banking Solution Accenture'
url = 'https://www.google.com/search?q=' + search_str
page = requests.get(url)
soup = bs(page.content,'lxml')
for node in soup.find_all('a'):
print("Inner Text : {}".format(node.text))
for h in soup.find_all('h3'):
print("Title : {}".format(h.text)
print("Link :{}".format(node.get('href'))
I am getting an error for title.
AttributeError: 'NoneType' object has no attribute 'text'
Clearly, it is not getting any title object from soup.find_all('a').
Question: What trick I have to apply here to get the title? Does soup.find_all('a') really contains any title` tag? What I am missing here?
Update: Based on the suggestions received, I have updated code piece. Now it is working. Although need to check the results in details.
In this case, node.get("title") returns None, as there is no title-attribute in the a-tag. This causes an error when accessing the .text-attribute. Instead, we may examine the HTML to figure out which part of the tag constitutes the title. See the example below.
from bs4 import BeautifulSoup
html = '''
<a href="https://html.com/attributes/a-title">
<br>
<h3>When To Use A Title In HTML</h3>
<div>
<cite role="text">https://html.com
<span role="text"> › attributes › a-title</span>
</cite>
</div>
</a>
'''
soup = BeautifulSoup(html, 'html.parser')
a_tags = soup.find_all('a')
for tag in a_tags:
print(tag.get("title")) # prints None
print(tag.h3.text) # prints "When To Use A Title In HTML"
In this search result, the title is found in the contents of the h3-tag. We can extract it using tag.h3.text. The browser devtools are a great resource when trying to figure out how to navigate the tags.
You might receive an empty output because of not using user-agent to act as a "real" user request.
If the request is being blocked, the response would contain different HTML with different elements/selectors and some sort of an error and that's why you might get an AttributeError: 'NoneType' object has no attribute 'text' error.
Also, when not specifying user-agent in request headers, requests library defaults to python-requests which websites understand and might block because they understand that it's a script that sends a request.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, json, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "Core Banking Solution Accenture", # search query
"gl": "us", # country of the search
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.79 Safari/537.36",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
results = []
for index, result in enumerate(soup.select(".tF2Cxc"), start=1):
title = result.select_one(".DKV0Md").text
link = result.select_one(".yuRUbf a")["href"]
try:
snippet = result.select_one("#rso .lyLwlc").text
except: snippet = None
results.append({
"position": index,
"title": title,
"link": link,
"snippet": snippet
})
print(json.dumps(results, indent=2, ensure_ascii=False))
Part of the output:
[
{
"position": 1,
"title": "Core Banking Services | Accenture",
"link": "https://www.accenture.com/us-en/services/banking/core-banking",
"snippet": "Accenture brings together the skills, technologies and capabilities to renew core banking systems in ways that put customers first and enable banks to release ..."
},
{
"position": 2,
"title": "Accenture In Consortium Hired to Implement Core Banking ...",
"link": "https://newsroom.accenture.com/industries/banking/accenture-in-consortium-hired-to-implement-core-banking-solution-at-polands-largest-bank.htm",
"snippet": "WARSAW; Aug. 18, 2003 –Accenture has signed a $114 million contract to implement a core-banking platform at PKO BP, Poland's largest bank, as part of a ..."
}, ... other results
]
Alternatively, you can achieve it by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create the parser from scratch, maintain it, figure out how to bypass blocks from Google or other search engines and how scale it.
Code to integrate:
from serpapi import GoogleSearch
import json, os
params = {
"api_key": "serpapi_key", # your serpapi api key
"engine": "google", # search engine
"q": "Core Banking Solution Accenture", # search query
"google_domain": "google.com", # google domain
"gl": "us", # country to search from
"hl": "en" # language
# other parameters
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
data = []
for result in results["organic_results"]:
data.append({
"position": result.get("position"),
"title": result.get("title"),
"link": result.get("link"),
"snippet": result.get("snippet")
})
print(json.dumps(data, indent=2, ensure_ascii=False))
Part of the output:
[
{
"position": 1,
"title": "Core Banking Services | Accenture",
"link": "https://www.accenture.com/us-en/services/banking/core-banking",
"snippet": "Accenture brings together the skills, technologies and capabilities to renew core banking systems in ways that put customers first and enable banks to release ..."
},
{
"position": 2,
"title": "Accenture In Consortium Hired to Implement Core Banking ...",
"link": "https://newsroom.accenture.com/industries/banking/accenture-in-consortium-hired-to-implement-core-banking-solution-at-polands-largest-bank.htm",
"snippet": "WARSAW; Aug. 18, 2003 –Accenture has signed a $114 million contract to implement a core-banking platform at PKO BP, Poland's largest bank, as part of a ..."
}, ... other results
]
Disclaimer, I work for SerpApi.
I'm a newbie with python.
In PyCharm I wrote this code:
import requests
from bs4 import BeautifulSoup
response = requests.get(f"https://www.google.com/search?q=fitness+wear")
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Instead getting the HTML of the search results, what I get is the HTML of the following page
I use the same code within a script on pythonanywhere.com and it works perfectly. I've tried lots of the solutions I found but the result is always the same, so now I'm stuck with it.
I think this should work:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
url = f"https://www.google.com/search?q=fitness+wear"
headers = {
"referer":"referer: https://www.google.com/",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
}
s.post(url, headers=headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
It uses a request session and a post request to create any initial cookies (not fully sure on this) and then allows you scrape.
If you open up a private Window in your browser and go to google.com, you should see the same pop-up prompting you to give your consent. This is, because you don't have session cookies send.
You have different options to tackle this.
One would be sending the cookies you can observe on the website with the request directly like so:
import requests
cookies = {"CONSENT":"YES+shp.gws-20210330-0-RC1.de+FX+412", ...}
resp = request.get(f"https://www.google.com/search?q=fitness+wear",cookies=cookies)
The solution #Dimitriy Kruglikov uses is a lot cleaner though and using sessions is a good way of having a persistent Session with the website.
Google doesn't block you, you still can extract data from the HTML.
Using cookies isn't very convenient and using session with post and get requests will lead to a bigger amount of traffic.
You can remove this popup by either using decompose() or extract() BS4 methods:
annoying_popup.decompose() will completely destroy it and its contents. Documentation.
annoying_popup.extract() will make another html tree: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. Documentation.
After that, you can scrape everything you need as well as without removing it.
See this Organic Results extraction I did recently. It scrapes title, summary, and link from Google Search Results.
Alternatively, you can use Google Search Engine Results API from SerpApi. Check out the Playground.
Code and example in online IDE:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "fus ro dah",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(f"Title: {result['title']}\nSnippet: {result['snippet']}\nLink: {result['link']}\n")
Output:
Title: Skyrim - FUS RO DAH (Dovahkiin) HD - YouTube
Snippet: I looked around for a fan made track that included Fus Ro Dah, but the ones that I found were pretty bad - some ...
Link: https://www.youtube.com/watch?v=JblD-FN3tgs
Title: Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Snippet: If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: "Fus Rah Do" instead of the proper "Fus Ro Dah." ...
Link: https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
Title: Fus Ro Dah | Know Your Meme
Snippet: Origin. "Fus Ro Dah" are the words for the "unrelenting force" thu'um shout in the game Elder Scrolls V: Skyrim. After reaching the first town of ...
Link: https://knowyourmeme.com/memes/fus-ro-dah
Title: Fus ro dah - Urban Dictionary
Snippet: 1. A dragon shout used in The Elder Scrolls V: Skyrim. 2.An international term for oral sex given by a female. ex.1. The Dragonborn yelled "Fus ...
Link: https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
Part of JSON:
"organic_results": [
{
"position": 1,
"title": "Unrelenting Force (Skyrim) | Elder Scrolls | Fandom",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)",
"displayed_link": "https://elderscrolls.fandom.com › wiki › Unrelenting_F...",
"snippet": "If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: \"Fus Rah Do\" instead of the proper \"Fus Ro Dah.\" ...",
"sitelinks": {
"inline": [
{
"title": "Location",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Location"
},
{
"title": "Effect",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Effect"
},
{
"title": "Usage",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Usage"
},
{
"title": "Word Wall",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Word_Wall"
}
]
},
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:K3LEBjvPps0J:https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)+&cd=17&hl=en&ct=clnk&gl=us"
}
]
Disclaimer, I work for SerpApi.
Having trouble scraping links and article names from google scholar. I'm unsure if the issue is with my code or the xpath that I'm using to retrieve the data – or possibly both?
I've already spent the past few hours trying to debug/consulting other stackoverflow queries but to no success.
import scrapy
from scrapyproj.items import ScrapyProjItem
class scholarScrape(scrapy.Spider):
name = "scholarScraper"
allowed_domains = "scholar.google.com"
start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]
def parse(self,response):
item = ScrapyProjItem()
item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = item.xpath("//div[#class='gs_rt']/h3").extract()
yield item
The error messages I have been receiving say: "AttributeError: xpath" so I believe that the issue lies with the path that I'm using to try and retrieve the data, but I could also be mistaken?
Adding my comment as an answer, as it solved the problem:
The issue is with scrapyproj.items.ScrapyProjItem objects: they do not have an xpath attribute. Is this an official scrapy class? I think you meant to call xpath on response:
item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = response.xpath("//div[#class='gs_rt']/h3").extract()
Also, the first path expression might need a set of quotes around the attribute value "gs_rt":
item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/#href").extract()
Apart from that, the XPath expressions are fine.
Alternative solution using bs4:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# Container where all articles located
for article_info in soup.select('#gsc_a_b .gsc_a_t'):
# title CSS selector
title = article_info.select_one('.gsc_a_at').text
# Same title CSS selector, except we're trying to get "data-href" attribute
# Note, it will be relative link, so you need to join it with absolute link after extracting.
title_link = article_info.select_one('.gsc_a_at')['data-href']
print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\n')
# Part of the output:
'''
Title: Automating Gödel's Ontological Proof of God's Existence with Higher-order Automated Theorem Provers.
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&citation_for_view=m8dFEawAAAAJ:-f6ydRqryjwC
'''
Alternatively, you can do the same with Google Scholar Author Articles API from SerpApi.
The main difference is that you don't have to think about finding good proxies, trying to solve CAPTCHA even if you're using selenium. It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "9PepYk8AAAAJ",
}
search = GoogleSearch(params)
results = search.get_dict()
for article in results['articles']:
article_title = article['title']
article_link = article['link']
# Part of the output:
'''
Title: p-GaN gate HEMTs with tungsten gate metal for high threshold voltage and low gate current
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:bUkhZ_yRbTwC
'''
Disclaimer, I work for SerpApi.
Goal: I would like to verify, if a specific Google search has a suggested result on the right hand side and - in case of such a suggestion - scrape some information like company type / address / etc.
Approach: I wanted to use a Python scraper with Requests and BeautifulSoup4
import bs4
import requests
address='https://www.google.co.ve/?gws_rd=cr&ei=DgBqVpWJMoPA-gHy25fACg#q=caracas+arepa'
page = requests.get(address)
soup = bs4.BeautifulSoup(page.content,'html.parser')
print (soup.prettify())
Issue:
The requested page does not include the search results (I am not sure if some variable on the Google page is set to invisible?), Rather only the header and footer of the Google page
Questions:
Alternative ways to obtain the described information? Any ideas?
Once I obtained results with the described method, but the respective address was constructed differently (I remember many numbers in the Google URL, but unfortunately cannot reproduce the search address). Therefore: Is there a requirement of the Google URL so that it can be scraped via requests.get?
The best way to get information from a service like Google Places will almost always be the official API. That said, if you're dead set on scraping, it's likely that what's returned by the HTTP request is meant for a browser to render. What BeautifulSoup does is not equivalent to rendering the data it receives, so it's very likely you're just getting useless empty containers that are later filled out dynamically.
I think your question is similar to google-search-with-python-reqeusts, maybe you could get some help from that~
And I agree with LiterallyElvis, API is better idea than crawl it directly.
Finally if you want to use requests for this work, I recommend to use PhantomJS and selenium to mock browser works, as Google should use some AJAX tech which makes different views between real browser and crawler.
As in country of difficult to visit Google, I couldn't repeat your problem directly, the above are sth I could think about, wish it helps
You need select_one() element (container) that contains all the needed data and check if it exists, and if so, scrape the data.
Make sure you're using user-agent to act as a "real" user visit, otherwise your request might be blocked or you receive a different HTML with different selectors. Check what's your user-agent.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "caracas arepa bar google",
"gl": "us"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# if right side knowledge graph is present -> parse the data.
if soup.select_one(".liYKde"):
place_name = soup.select_one(".PZPZlf.q8U8x span").text
place_type = soup.select_one(".YhemCb+ .YhemCb").text
place_reviews = soup.select_one(".hqzQac span").text
place_rating = soup.select_one(".Aq14fc").text
print(place_name, place_type, place_reviews, place_rating, sep="\n")
# output:
'''
Caracas Arepa Bar
Venezuelan restaurant
1,123 Google reviews
4.5
'''
Alternatively, you can achieve the same thing using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan.
The biggest difference is that you don't need to figure out how to parse the data, increase the number of requests, bypass blocks from Google, and other search engines.
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "caracas arepa bar place",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps([results["knowledge_graph"]], indent=2))
# part of the output:
'''
[
{
"title": "Caracas Arepa Bar",
"type": "Venezuelan restaurant",
"place_id": "ChIJVcQ2ll9ZwokRwmkvsArPXyo",
"website": "http://caracasarepabar.com/",
"description": "Arepa specialist offering creative, low-priced renditions of the Venezuelan corn-flour staple.",
"local_map": {
"image": "https://www.google.com/maps/vt/data=TF2Rd51PtEnU2M3pkZHYHKdSwhMDJ_ZwRfg0vfwlDRAmv1u919sgFl8hs_lo832ziTWxCZM9BKECs6Af-TA1hh0NLjuYAzOLFA1-RBEmj-8poygymcRX2KLNVTGGZZKDerZrKW6fnkONAM4Ui-BVN8XwFrwigoqqxObPg8bqFIgeM3LPCg",
"link": "https://www.google.com/maps/place/Caracas+Arepa+Bar/#40.7131972,-73.9574167,15z/data=!4m2!3m1!1s0x0:0x2a5fcf0ab02f69c2?sa=X&hl=en",
"gps_coordinates": {
"latitude": 40.7131972,
"longitude": -73.9574167,
"altitude": 15
}
} ... much more results including place images, popular times, user reviews.
}
]
'''
Disclaimer: I work for SerpApi.
I asked a question on realizing a general idea to crawl and save webpages.
Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.
With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).
Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.
The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing.
What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).
p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.
Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P
Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.
import time, random
from xgoogle.search import GoogleSearch, SearchError
f = open('a.txt','wb')
for i in range(0,2):
wt = random.uniform(2, 5)
gs = GoogleSearch("about")
gs.results_per_page = 10
gs.page = i
results = gs.get_results()
#Try not to annnoy Google, with a random short wait
time.sleep(wt)
print 'This is the %dth iteration and waited %f seconds' % (i, wt)
for res in results:
f.write(res.url.encode("utf8"))
f.write("\n")
print "Done"
f.close()
Note on xgoogle (below answered by Mike Pennington):
The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.
Resources known so far:
For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most
popular choices. Of course. lxml too.
You may find xgoogle useful... much of what you seem to be asking for is there...
There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007).
It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes.
BTW, it's based on mechanize.
As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)
Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py
Another option to scrape Google search results using Python is the one by ZenSERP.
I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.
Here is an example for a curl request:
curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"
And the response:
{
"q": "Pied Piper",
"domain": "google.com",
"location": "United States",
"language": "English",
"url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
"total_results": 17100000,
"auto_correct": "",
"auto_correct_type": "",
"results": []
}
A Python code for example:
import requests
headers = {
'apikey': 'APIKEY',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
To extract links from multiple pages of Google Search results you can use SerpApi. It's a paid API with a free trial.
Full example
import os
# Python package: https://pypi.org/project/google-search-results
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "about",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}\n")
for organic_result in result["organic_results"]:
print(
f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n"
)
Output
Current page: 12
URL: https://fi.google.com/
URL: https://www.mayoclinic.org/about-mayo-clinic
...
Current page: 18
URL: https://igem.org/About
URL: https://www.ieee.org/
URL: https://www.cancer.org/
...
Disclaimer: I work at SerpApi.
This one works good for this moment. If any search is made, the scraper keeps grabbing titles and their links traversing all next pages until there is no more next page is left or your ip address is banned. Make sure your bs4 version is >= 4.7.0 as I've used pseudo css selector within the script.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
base = "https://www.google.de"
link = "https://www.google.de/search?q={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def grab_content(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for container in soup.select("[class='g'] a[href^='http'][data-ved]:has(h3)"):
post_title = container.select_one("h3").get_text(strip=True)
post_link = container.get('href')
yield post_title,post_link
next_page = soup.select_one("a[href][id='pnnext']")
if next_page:
next_page_link = urljoin(base,next_page.get("href"))
yield from grab_content(next_page_link)
if __name__ == '__main__':
search_keyword = "python"
qualified_link = link.format(search_keyword.replace(" ","+"))
for item in grab_content(qualified_link):
print(item)
This can be done using google and beautifulsoup module, install it in CMD using command given below:
pip install google beautifulsoup4
Thereafter, run this simplified code given below
import webbrowser, googlesearch as gs
def direct(txt):
print(f"sure, searching '{txt}'...")
results=gs.search(txt,num=1,stop=1,pause=0)
#num, stop denotes number of search results you want
for link in results:
print(link)
webbrowser.open_new_tab(link)#to open the results in browser
direct('cheap thrills on Youtube') #this will play the song on YouTube
#(for this, keep num=1,stop=1)
Output:
TIP: Using this, you can also make a small Virtual Assistant that will open the top search result in browser for your given query(txt) in natural language.
Feel free to comment in case of difficulty while running this code:)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re
import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)
html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(),"html.parser")
#Regex
reg=re.compile(".*&sa=")
links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
line = (reg.match(item.a['href'][7:]).group())
links.append(line[:-4])
print(links)
this should be handy....for more go to -
https://github.com/goyal15rajat/Crawl-google-search.git
Here is a Python script using requests and BeautifulSoup to scrape Google results.
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
print(results)