I'm trying to scrape Google Finance, and get the "Related Stocks" table, which has id "cc-table" and class "gf-table" based on the webpage inspector in Chrome. (Sample Link: https://www.google.com/finance?q=tsla)
But when I run .find("table") or .findAll("table"), this table does not come up. I can find JSON-looking objects with the table's contents in the HTML content in Python, but do not know how to get it. Any ideas?
The page is rendered with JavaScript. There are several ways to render and scrape it.
I can scrape it with Selenium.
First install Selenium:
sudo pip3 install selenium
Then get a driver https://sites.google.com/a/chromium.org/chromedriver/downloads
import bs4 as bs
from selenium import webdriver
browser = webdriver.Chrome()
url = ("https://www.google.com/finance?q=tsla")
browser.get(url)
html_source = browser.page_source
browser.quit()
soup = bs.BeautifulSoup(html_source, "lxml")
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
Alternatively PyQt5
from PyQt5.QtGui import *
from PyQt5.QtCore import *
from PyQt5.QtWebKit import *
from PyQt5.QtWebKitWidgets import QWebPage
from PyQt5.QtWidgets import QApplication
import bs4 as bs
import sys
class Render(QWebPage):
def __init__(self, url):
self.app = QApplication(sys.argv)
QWebPage.__init__(self)
self.loadFinished.connect(self._loadFinished)
self.mainFrame().load(QUrl(url))
self.app.exec_()
def _loadFinished(self, result):
self.frame = self.mainFrame()
self.app.quit()
url = "https://www.google.com/finance?q=tsla"
r = Render(url)
result = r.frame.toHtml()
soup = bs.BeautifulSoup(result,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
Alternatively Dryscrape
import bs4 as bs
import dryscrape
url = "https://www.google.com/finance?q=tsla"
session = dryscrape.Session()
session.visit(url)
dsire_get = session.body()
soup = bs.BeautifulSoup(dsire_get,'lxml')
for el in soup.find_all("table", {"id": "cc-table"}):
print(el.get_text())
all output:
Valuation▲▼Company name▲▼Price▲▼Change▲▼Chg %▲▼d | m | y▲▼Mkt Cap▲▼TSLATesla Inc328.40-1.52-0.46%53.69BDDAIFDaimler AG72.94-1.50-2.01%76.29BFFord Motor Company11.53-0.17-1.45%45.25BGMGeneral Motors Co...36.07-0.34-0.93%53.93BRNSDFRENAULT SA EUR3.8197.000.000.00%28.69BHMCHonda Motor Co Lt...27.52-0.18-0.65%49.47BAUDVFAUDI AG NPV840.400.000.00%36.14BTMToyota Motor Corp...109.31-0.53-0.48%177.79BBAMXFBAYER MOTOREN WER...94.57-2.41-2.48%56.93BNSANYNissan Motor Co L...20.400.000.00%42.85BMMTOFMITSUBISHI MOTOR ...6.86+0.091.26%10.22B
EDIT
QtWebKit got deprecated upstream in Qt 5.5 and removed in 5.6.
You can switch to PyQt5.QtWebEngineWidgets
You can scrape Google Finance using BeautifulSoup web scraping library without the need to use selenium as the data you want to extract doesn't render via Javascript. Plus it will be much faster than launching the whole browser.
Check code in online IDE.
from bs4 import BeautifulSoup
import requests, lxml, json
params = {
"hl": "en"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36",
}
html = requests.get(f"https://www.google.com/finance?q=tsla)", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
ticker_data = []
for ticker in soup.select('.tOzDHb'):
title = ticker.select_one('.RwFyvf').text
price = ticker.select_one('.YMlKec').text
index = ticker.select_one('.COaKTb').text
price_change = ticker.select_one("[jsname=Fe7oBc]")["aria-label"]
ticker_data.append({
"index": index,
"title" : title,
"price" : price,
"price_change" : price_change
})
print(json.dumps(ticker_data, indent=2))
Example output
[
{
"index": "Index",
"title": "Dow Jones Industrial Average",
"price": "32,774.41",
"price_change": "Down by 0.18%"
},
{
"index": "Index",
"title": "S&P 500",
"price": "4,122.47",
"price_change": "Down by 0.42%"
},
{
"index": "TSLA",
"title": "Tesla Inc",
"price": "$850.00",
"price_change": "Down by 2.44%"
},
# ...
]
There's a scrape Google Finance Ticker Quote Data in Python blog post if you need to scrape more data from Google Finance.
Most website owners don't like scrapers because they take data the company values, use up a whole bunch of their server time and bandwidth, and give nothing in return. Big companies like Google may have entire teams employing a whole host of methods to detect and block bots trying to scrape their data.
There are several ways around this:
Scrape from another less secured website.
See if Google or another company has an API for public use.
Use a more advanced scraper like Selenium (and probably still be blocked by google).
Related
I'm a newbie with python.
In PyCharm I wrote this code:
import requests
from bs4 import BeautifulSoup
response = requests.get(f"https://www.google.com/search?q=fitness+wear")
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
Instead getting the HTML of the search results, what I get is the HTML of the following page
I use the same code within a script on pythonanywhere.com and it works perfectly. I've tried lots of the solutions I found but the result is always the same, so now I'm stuck with it.
I think this should work:
import requests
from bs4 import BeautifulSoup
with requests.Session() as s:
url = f"https://www.google.com/search?q=fitness+wear"
headers = {
"referer":"referer: https://www.google.com/",
"user-agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
}
s.post(url, headers=headers)
response = s.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup)
It uses a request session and a post request to create any initial cookies (not fully sure on this) and then allows you scrape.
If you open up a private Window in your browser and go to google.com, you should see the same pop-up prompting you to give your consent. This is, because you don't have session cookies send.
You have different options to tackle this.
One would be sending the cookies you can observe on the website with the request directly like so:
import requests
cookies = {"CONSENT":"YES+shp.gws-20210330-0-RC1.de+FX+412", ...}
resp = request.get(f"https://www.google.com/search?q=fitness+wear",cookies=cookies)
The solution #Dimitriy Kruglikov uses is a lot cleaner though and using sessions is a good way of having a persistent Session with the website.
Google doesn't block you, you still can extract data from the HTML.
Using cookies isn't very convenient and using session with post and get requests will lead to a bigger amount of traffic.
You can remove this popup by either using decompose() or extract() BS4 methods:
annoying_popup.decompose() will completely destroy it and its contents. Documentation.
annoying_popup.extract() will make another html tree: one rooted at the BeautifulSoup object you used to parse the document, and one rooted at the tag that was extracted. Documentation.
After that, you can scrape everything you need as well as without removing it.
See this Organic Results extraction I did recently. It scrapes title, summary, and link from Google Search Results.
Alternatively, you can use Google Search Engine Results API from SerpApi. Check out the Playground.
Code and example in online IDE:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "fus ro dah",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
print(f"Title: {result['title']}\nSnippet: {result['snippet']}\nLink: {result['link']}\n")
Output:
Title: Skyrim - FUS RO DAH (Dovahkiin) HD - YouTube
Snippet: I looked around for a fan made track that included Fus Ro Dah, but the ones that I found were pretty bad - some ...
Link: https://www.youtube.com/watch?v=JblD-FN3tgs
Title: Unrelenting Force (Skyrim) | Elder Scrolls | Fandom
Snippet: If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: "Fus Rah Do" instead of the proper "Fus Ro Dah." ...
Link: https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
Title: Fus Ro Dah | Know Your Meme
Snippet: Origin. "Fus Ro Dah" are the words for the "unrelenting force" thu'um shout in the game Elder Scrolls V: Skyrim. After reaching the first town of ...
Link: https://knowyourmeme.com/memes/fus-ro-dah
Title: Fus ro dah - Urban Dictionary
Snippet: 1. A dragon shout used in The Elder Scrolls V: Skyrim. 2.An international term for oral sex given by a female. ex.1. The Dragonborn yelled "Fus ...
Link: https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
Part of JSON:
"organic_results": [
{
"position": 1,
"title": "Unrelenting Force (Skyrim) | Elder Scrolls | Fandom",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)",
"displayed_link": "https://elderscrolls.fandom.com › wiki › Unrelenting_F...",
"snippet": "If the general subtitles are turned on, it can be seen that the text for the Draugr's Unrelenting Force is misspelled: \"Fus Rah Do\" instead of the proper \"Fus Ro Dah.\" ...",
"sitelinks": {
"inline": [
{
"title": "Location",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Location"
},
{
"title": "Effect",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Effect"
},
{
"title": "Usage",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Usage"
},
{
"title": "Word Wall",
"link": "https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)#Word_Wall"
}
]
},
"cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:K3LEBjvPps0J:https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)+&cd=17&hl=en&ct=clnk&gl=us"
}
]
Disclaimer, I work for SerpApi.
I am trying to scrape google knowledge panel to retrieve the name of drugs if they do not appear in google search. For instance if I look for "Buscopan" in Google, the appearing webpage looks like this:
Now, what I am trying to do with the code shown is take the term "Scopolamina-N-butilbromuro" in the knowledge panel but am actually unable to retrieve it in the html code once I inspect the element. To be precise. The code I am implementing together with the error message is as follows:
import requests
from bs4 import BeautifulSoup
URL
url = "https://www.google.com/search?client=safari&rls=en&q="+"buscopan"+"&ie=UTF-8&oe=UTF-8"
# Sending HTTP request
req = requests.get(url)
# Pulling HTTP data from internet
sor = BeautifulSoup(req.text, "html.parser")
temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
print(temp)
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-39-ef5599a1a1fc> in <module>
13 # Finding temperature in Celsius
14 #temp = sor.find("h2", class_='qrShPb').text
---> 15 temp = sor.find("h2", class_= "qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe").text
16
17
AttributeError: 'NoneType' object has no attribute 'text'
I don't know what I am doing wrong. I think the bit of html code I need to look at is the following:
<h2 class="qrShPb kno-ecr-pt PZPZlf mfMhoc hNKfZe" data-local-attribute="d3bn" data-attrid="title" data-ved="2ahUKEwjujfLcgO7rAhWKjosKHSiBAFEQ3B0oATASegQIEBAL"></h2>
Of course the rest of the html code is in the picture reported, but if you need a bigger version, please, do not esitate!
Any suggestion?
Thank you,
Federico
To get correct result page from Google search, specify User-Agent HTTP header. For example:
import requests
from bs4 import BeautifulSoup
params = {
'q': 'buscopan', # <-- change to your keyword
'hl': 'it' # <-- change to `en` for english results
}
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:80.0) Gecko/20100101 Firefox/80.0'}
url = 'https://www.google.com/search'
soup = BeautifulSoup(requests.get(url, params=params, headers=headers).content, 'html.parser')
print(soup.select_one('h2[data-attrid="title"]').text)
Prints:
Scopolamina-N-butilbromuro
Alternatively, to Andrej Kesely solution, you can use third-party Google Knowledge Graph API from SerpApi. It's a paid API with a free plan. Check out the Playground to test.
Code to integrate and full example in the online IDE:
from serpapi import GoogleSearch
import os
params = {
"q": "Buscopan",
"google_domain": "google.com",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
title = results['knowledge_graph']['title']
print(title)
Output:
Butylscopolamine
Part of JSON Knowledge Graph output:
"knowledge_graph": {
"title": "Butylscopolamine",
"type": "Medication",
"description": "Hyoscine butylbromide, also known as scopolamine butylbromide and sold under the brandname Buscopan among others, is an anticholinergic medication used to treat crampy abdominal pain, esophageal spasms, renal colic, and bladder spasms. It is also used to improve respiratory secretions at the end of life.",
"source": {
"name": "Wikipedia",
"link": "https://en.wikipedia.org/wiki/Hyoscine_butylbromide"
},
"formula": "C₂₁H₃₀BrNO₄",
"molar_mass": "440.371 g/mol",
"chem_spider_id": "16736107",
"trade_name": "Buscopan, others",
"pub_chem_cid": "6852391",
"ch_ebi_id": "32123",
"people_also_search_for": "Scopolamine, Metamizole, MORE"
}
Disclaimer, I work for SerpApi.
I need to grab text data from google search engine info bar. If someone use a keyword "siemens" to search on google search engine. A small info bar appear right side one the google search result. I want to collect some text information for that info bar. How can I do that using requests and Beautifulsoup. here some on the code I write.
from bs4 import BeautifulSoup as BS
import requests
from googlesearch import search
from googleapiclient.discovery import build
url = 'https://www.google.com/search?ei=j-iKXNDxDMPdwALdwofACg&q='
com = 'siemens'
#for url in search(com, tld='de', lang='de', stop=10):
# print(url)
response = requests.get(url+com)
soup = BS(response.content, 'html.parser')
Red marked area is info bar
You can use the find function in BeautifuLSoup to retrieve all the elements with a given class name, id, css selector, xpath etc. If you inspect the info bar (right click on it and give 'inspect') you can find the unique class name or id for that bar. Use that to filter the info bar alone from your entire html parsed by BeautifulSoup.
Check out find() and findall() in BeautifulSoup to achieve your output. Always go for finding by id first, since every id is unique to an html element. If there is no id for that, then go for the other options.
To obtain the URL, use google.com/search?q=[] with your search query inside []. For queries with more than one word, use a '+' inbetween
Make sure you're using user-agent to fake real user visit, otherwise it might lead to a blocked request from Google. List of user-agents.
To visually select elements from a page, you can use SelectorGadgets Chrome extension to grab CSS selectors.
Code and example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?q=simens', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
title = soup.select_one('.SPZz6b h2').text
subtitle = soup.select_one('.wwUB2c span').text
website = soup.select_one('.ellip .ellip').text
snippet = soup.select_one('.Uo8X3b+ span').text
print(f'{title}\n{subtitle}\n{website}\n{snippet}')
Output:
Siemens
Automation company
siemens.com
Siemens AG is a German multinational conglomerate company headquartered in Munich and the largest industrial manufacturing company in Europe with branch offices abroad.
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "simens",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
title = results["knowledge_graph"]["title"]
subtitle = results["knowledge_graph"]["type"]
website = results["knowledge_graph"]["website"]
snippet = results["knowledge_graph"]["description"]
print(f'{title}\n{subtitle}\n{website}\n{snippet}')
Output:
Siemens
Automation company
http://www.siemens.com/
Siemens AG is a German multinational conglomerate company headquartered in Munich and the largest industrial manufacturing company in Europe with branch offices abroad.
Disclaimer, I work at SerpApi.
Working on a little web scraping program to get some data and help me make some bets.
Ultimately, I want to parse the "Trends" section under each game of the current week on pages like this (https://www.oddsshark.com/nfl/arizona-kansas-city-odds-november-11-2018-971332)
My current algorithm:
GET https://www.oddsshark.com/nfl/scores
Parse the webpage for the little "vs" button which holds links to all the games
Parse for the Trends
Here's how I started:
from bs4 import BeautifulSoup
import requests
url = "https://www.oddsshark.com/nfl/scores"
result = requests.get("https://www.oddsshark.com/nfl/scores")
print ("Status: ", result.status_code)
content = result.content
soup = BeautifulSoup(content, 'html.parser')
print (soup)
When I look at the output, I don't really see any of those links. Is it cause a lot of the site of javascript?
Any pointers on the code/algorithm appreciated!
You can use the internal API this sites uses to get all the links & iterate over these to get the trends info which is embedded in a script tag with id:gc-data :
import requests
import json
from bs4 import BeautifulSoup
r = requests.get(
'https://io.oddsshark.com/ticker/nfl',
headers = {
'referer': 'https://www.oddsshark.com/nfl/scores'
}
)
links = [
(
t["event_date"],
t["away_name"],
t["home_name"],
"https://www.oddsshark.com{}".format(t["matchup_link"])
)
for t in r.json()['matchups']
if t["type"] == "matchup"
]
for t in links:
print("{} - {} vs {} => {}".format(t[0],t[1],t[2],t[3]))
r = requests.get(t[3])
soup = BeautifulSoup(r.content, "lxml")
trends = [
json.loads(v.text)
for v in soup.findAll('script', {"type":"application/json", "id":"gc-data"})
]
print(trends[0]["oddsshark_gamecenter"]["trends"])
print("#########################################")
The reason you don't see those links is that they're not in the response that requests receives. This is very likely for one of two reasons:
The server recognizes that you are trying to scrape the site with a script, and sends you different content. Usually this is because of the User-Agent set by requests.
The content is added dynamically via JavaScript that runs in the browser.
You could probably render this content using a headless browser in your python script and end up with the same content you see when you visit the site with Chrome et. Per (1) it might be necessary to experiment with the User-Agent header in your request also.
The data is loaded via javascript to the trends table, but is actually included in a script tag inside the html that you receive. You can parse it like this:
import requests
import json
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:59.0) Gecko/20100101 Firefox/59.0'
}
response = requests.get('https://www.oddsshark.com/nfl/arizona-kansas-city-odds-november-11-2018-971332', headers=headers)
soup = BeautifulSoup(response.text, "lxml")
data = json.loads(soup.find("script", {'id': 'gc-data'}).text)
print(data['oddsshark_gamecenter']['trends'])
Outputs:
{'local': {'title': 'Trends'}, 'away': [{'value': 'Arizona is 4-1-1
ATS in its last 6 games '}, {'value': 'Arizona is 2-6 SU in its last 8
games '}, {'value': "The total has gone UNDER in 8 of Arizona's last
12 games "}, {'value': 'Arizona is 3-7-1 ATS in its last 11 games on
the road'}, {'value': 'Arizona is 2-4 SU in its last 6 games on the
road'}...
I asked a question on realizing a general idea to crawl and save webpages.
Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.
With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).
Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.
The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing.
What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).
p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.
Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P
Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.
import time, random
from xgoogle.search import GoogleSearch, SearchError
f = open('a.txt','wb')
for i in range(0,2):
wt = random.uniform(2, 5)
gs = GoogleSearch("about")
gs.results_per_page = 10
gs.page = i
results = gs.get_results()
#Try not to annnoy Google, with a random short wait
time.sleep(wt)
print 'This is the %dth iteration and waited %f seconds' % (i, wt)
for res in results:
f.write(res.url.encode("utf8"))
f.write("\n")
print "Done"
f.close()
Note on xgoogle (below answered by Mike Pennington):
The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.
Resources known so far:
For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most
popular choices. Of course. lxml too.
You may find xgoogle useful... much of what you seem to be asking for is there...
There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007).
It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes.
BTW, it's based on mechanize.
As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)
Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py
Another option to scrape Google search results using Python is the one by ZenSERP.
I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.
Here is an example for a curl request:
curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"
And the response:
{
"q": "Pied Piper",
"domain": "google.com",
"location": "United States",
"language": "English",
"url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
"total_results": 17100000,
"auto_correct": "",
"auto_correct_type": "",
"results": []
}
A Python code for example:
import requests
headers = {
'apikey': 'APIKEY',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
To extract links from multiple pages of Google Search results you can use SerpApi. It's a paid API with a free trial.
Full example
import os
# Python package: https://pypi.org/project/google-search-results
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "about",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}\n")
for organic_result in result["organic_results"]:
print(
f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n"
)
Output
Current page: 12
URL: https://fi.google.com/
URL: https://www.mayoclinic.org/about-mayo-clinic
...
Current page: 18
URL: https://igem.org/About
URL: https://www.ieee.org/
URL: https://www.cancer.org/
...
Disclaimer: I work at SerpApi.
This one works good for this moment. If any search is made, the scraper keeps grabbing titles and their links traversing all next pages until there is no more next page is left or your ip address is banned. Make sure your bs4 version is >= 4.7.0 as I've used pseudo css selector within the script.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
base = "https://www.google.de"
link = "https://www.google.de/search?q={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def grab_content(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for container in soup.select("[class='g'] a[href^='http'][data-ved]:has(h3)"):
post_title = container.select_one("h3").get_text(strip=True)
post_link = container.get('href')
yield post_title,post_link
next_page = soup.select_one("a[href][id='pnnext']")
if next_page:
next_page_link = urljoin(base,next_page.get("href"))
yield from grab_content(next_page_link)
if __name__ == '__main__':
search_keyword = "python"
qualified_link = link.format(search_keyword.replace(" ","+"))
for item in grab_content(qualified_link):
print(item)
This can be done using google and beautifulsoup module, install it in CMD using command given below:
pip install google beautifulsoup4
Thereafter, run this simplified code given below
import webbrowser, googlesearch as gs
def direct(txt):
print(f"sure, searching '{txt}'...")
results=gs.search(txt,num=1,stop=1,pause=0)
#num, stop denotes number of search results you want
for link in results:
print(link)
webbrowser.open_new_tab(link)#to open the results in browser
direct('cheap thrills on Youtube') #this will play the song on YouTube
#(for this, keep num=1,stop=1)
Output:
TIP: Using this, you can also make a small Virtual Assistant that will open the top search result in browser for your given query(txt) in natural language.
Feel free to comment in case of difficulty while running this code:)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re
import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)
html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(),"html.parser")
#Regex
reg=re.compile(".*&sa=")
links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
line = (reg.match(item.a['href'][7:]).group())
links.append(line[:-4])
print(links)
this should be handy....for more go to -
https://github.com/goyal15rajat/Crawl-google-search.git
Here is a Python script using requests and BeautifulSoup to scrape Google results.
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
print(results)