How to grab text data from Google search info bar - python

I need to grab text data from google search engine info bar. If someone use a keyword "siemens" to search on google search engine. A small info bar appear right side one the google search result. I want to collect some text information for that info bar. How can I do that using requests and Beautifulsoup. here some on the code I write.
from bs4 import BeautifulSoup as BS
import requests
from googlesearch import search
from googleapiclient.discovery import build
url = 'https://www.google.com/search?ei=j-iKXNDxDMPdwALdwofACg&q='
com = 'siemens'
#for url in search(com, tld='de', lang='de', stop=10):
# print(url)
response = requests.get(url+com)
soup = BS(response.content, 'html.parser')
Red marked area is info bar

You can use the find function in BeautifuLSoup to retrieve all the elements with a given class name, id, css selector, xpath etc. If you inspect the info bar (right click on it and give 'inspect') you can find the unique class name or id for that bar. Use that to filter the info bar alone from your entire html parsed by BeautifulSoup.
Check out find() and findall() in BeautifulSoup to achieve your output. Always go for finding by id first, since every id is unique to an html element. If there is no id for that, then go for the other options.
To obtain the URL, use google.com/search?q=[] with your search query inside []. For queries with more than one word, use a '+' inbetween

Make sure you're using user-agent to fake real user visit, otherwise it might lead to a blocked request from Google. List of user-agents.
To visually select elements from a page, you can use SelectorGadgets Chrome extension to grab CSS selectors.
Code and example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get('https://www.google.com/search?q=simens', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
title = soup.select_one('.SPZz6b h2').text
subtitle = soup.select_one('.wwUB2c span').text
website = soup.select_one('.ellip .ellip').text
snippet = soup.select_one('.Uo8X3b+ span').text
print(f'{title}\n{subtitle}\n{website}\n{snippet}')
Output:
Siemens
Automation company
siemens.com
Siemens AG is a German multinational conglomerate company headquartered in Munich and the largest industrial manufacturing company in Europe with branch offices abroad.
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "simens",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
title = results["knowledge_graph"]["title"]
subtitle = results["knowledge_graph"]["type"]
website = results["knowledge_graph"]["website"]
snippet = results["knowledge_graph"]["description"]
print(f'{title}\n{subtitle}\n{website}\n{snippet}')
Output:
Siemens
Automation company
http://www.siemens.com/
Siemens AG is a German multinational conglomerate company headquartered in Munich and the largest industrial manufacturing company in Europe with branch offices abroad.
Disclaimer, I work at SerpApi.

Related

I can't spot some of the elements in the site's source code

I was trying to scrape this website to get the player data.
https://mystics.wnba.com/roster/
I viewed the code using 'Inspect' but the main table isn't in the source code. For example, this is the code for the first player's name:
<div class="content-table__player-name">
<a ng-href="https://www.wnba.com/player/ariel-atkins/" target="_self" href="https://www.wnba.com/player/ariel-atkins/">Ariel Atkins</a>
</div>
I can't find this piece of code (or any code for the player data) in the page source. I searched for most of the table's divs in the source code but I couldn't find any of them.
The content is generated on the fly, using some JavaScript. To get the data you want, your program need to be able to run and interpret JavaScript. You can use tools like Selenium or the headless mode of Chrome, to extract the DOM from a running browser.
In Firefox you can press F12 to inspect the DOM that was generated by the JavaScript code. In there, you can locate the desired entries. You can also inspect the Network tab, which shows you the requests the site is sending to the server. You might be able identify the requests that return your desired results.
As the tag contains scrapy. So, here is a solution using scrapy.
import scrapy
import json
class Test(scrapy.Spider):
name = 'test'
start_urls = ['https://data.wnba.com/data/5s/v2015/json/mobile_teams/wnba/2021/teams/mystics_roster.json']
def parse(self, response):
data = json.loads(response.body)
data = data.get('t').get('pl')
for player in data:
print(player.get('fn'),player.get('ln'))
The following is how you can access the content using requests module.
import requests
link = 'https://data.wnba.com/data/5s/v2015/json/mobile_teams/wnba/2021/teams/mystics_roster.json'
with requests.Session() as s:
s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
res = s.get(link)
for item in res.json()['t']['pl']:
print(item['fn'],item['ln'])
Output:
Leilani Mitchell
Shavonte Zellous
Tina Charles
Elena Delle Donne
Theresa Plaisance
Natasha Cloud
Shatori Walker-Kimbrough
Sydney Wiese
Erica McCall
Ariel Atkins
Myisha Hines-Allen
Megan Gustafson

Python html parsing partial class names

I am trying to parse a webpage with bs4 but the elements I am trying to access all have different class names.
Example: class='list-item listing … id-12984' and class='list-item listing … id-10359'
def preownedaston(url):
preownedaston_resp = requests.get(url)
if preownedaston_resp.status_code == 200:
bs = BeautifulSoup(preownedaston_resp.text, 'lxml')
posts = bs.find_all('div', class_='') #don't know what to put here
for p in posts:
title_year = p.find('div', class_='inset').find('a').find('span', class_='model_year').text
print(title_year)
preownedaston('https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760')
Is there a way to parse a partial class name like class_='list-item '?
Css Selector for matching a partial value of a certain attribute is as follows :
div[class*='list-item'] # the * means match the class with this partial value
But if you look at the source code of the page you will see that the content you are trying to scrape is being generated by Javascript So you have three options here
Use Selenium with a headless browser to render the javescript
Look for the Ajax calls and try to simulate them for example this url is the ajax call the website uses to retrieve the data Ajax URL
Look for the data you are trying to scrape into a script tag as follows :
I prefer this one in similar situation because you will be parsing Json
import requests , json
from bs4 import BeautifulSoup
URL = 'https://preowned.astonmartin.com/preowned-cars/search/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&postcode-area=United%20Kingdom&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760'
page = requests.get(URL, headers={"User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36"})
soup = BeautifulSoup(page.text, 'html.parser')
json_obj = soup.find('script',{'type':"application/ld+json"}).text
#{"#context":"http://schema.org","#graph":[{"#type":"Brand","name":""},{"#type":"OfferCatalog","itemListElement":[{"#type":"Offer","name":"Pre-Owned By Aston Martin","price":"€114,900.00","url":"https://preowned.astonmartin.com/preowned-cars/12984-aston-martin-v12-vantage-v8-volante/","itemOffered":{"#type":"Car","name":"Aston Martin V12 Vantage V8 Volante","brand":"Aston Martin","model":"V12 Vantage","itemCondition":"Used","category":"Used","productionDate":"2010","releaseDate":"2011","bodyType":"6.0 Litre V12","emissionsCO2":"388","fuelType":"Obsidian Black","mileageFromOdometer":"42000","modelDate":"2011","seatingCapacity":"2","speed":"190","vehicleEngine":"6l","vehicleInteriorColor":"Obsidian Black","color":"Black"}},{"#type":"Offer","name":"Pre-Owned By Aston Martin","price":"€99,900.00","url":"https://preowned.astonmartin.com/preowned-cars/10359-aston-martin-v12-vantage-carbon-edition-coupe/","itemOffered":{"#type":"Car","name":"Aston Martin V12 Vantage Carbon Edition Coupe","brand":"Aston Martin","model":"V12 Vantage","itemCondition":"Used","category":"Used","productionDate":"2011","releaseDate":"2011","bodyType":"6.0 Litre V12","emissionsCO2":"388","fuelType":"Obsidian Black","mileageFromOdometer":"42000","modelDate":"2011","seatingCapacity":"2","speed":"190","vehicleEngine":"6l","vehicleInteriorColor":"Obsidian Black","color":"Black"}}]},{"#type":"BreadcrumbList","itemListElement":[{"#type":"ListItem","position":"1","item":{"#id":"https://preowned.astonmartin.com/","name":"Homepage"}},{"#type":"ListItem","position":"2","item":{"#id":"https://preowned.astonmartin.com/preowned-cars/","name":"Pre-Owned Cars"}},{"#type":"ListItem","position":"3","item":{"#id":"//preowned.astonmartin.com/preowned-cars/search/","name":"Pre-Owned By Aston Martin"}}]}]}
items = json.loads(json_obj)['#graph'][1]['itemListElement']
for item in items :
print(item['itemOffered']['name'])
Output:
Aston Martin V12 Vantage V8 Volante
Aston Martin V12 Vantage Carbon Edition Coupe
The information from this URL actually comes back in JSON format which means you can easily extract the fields you want. For example:
import requests
url = "https://preowned.astonmartin.com/ajax/stock-listing/get-items/pageId/3760/ratio/3_2/taxBandImageLink/aHR0cHM6Ly9kMnBwMTFwZ29wNWY2cC5jbG91ZGZyb250Lm5ldC9UYXhCYW5kLSV0YXhfYmFuZCUuanBn/taxBandImageHyperlink/JWRlYWxlcl9lbWFpbCU=/imgWidth/767/?finance%5B%5D=price&price-currency%5B%5D=EUR&custom-model%5B404%5D%5B%5D=809&continent-country%5B%5D=France&distance%5B%5D=0&transmission%5B%5D=Manual&budget-program%5B%5D=pay&section%5B%5D=109&order=-usd_price&pageId=3760"
r = requests.get(url)
data = r.json()
details = ['make', 'mileage', 'model', 'model_year', 'mpg', 'exterior_colour', 'price_now']
for vehicle in data['vehicles']:
print()
for key in details:
print(f"{key:18} : {vehicle[key]}")
This displays the following:
make : Aston Martin
mileage : 42,000 km
model : V12 Vantage
model_year : 2011
mpg : 17.3
exterior_colour : Carbon Black
price_now : €114,900
make : Aston Martin
mileage : 42,000 km
model : V12 Vantage
model_year : 2011
mpg : 17.3
exterior_colour : Carbon Black
price_now : €99,900
Note: it might be necessary to add a user agent request header if the data is not returned. If you display data you can see all of the available information for each vehicle.
This approach avoids the need to have javascript processing via Selenium and also avoids needing to parse any HTML using BeautifulSoup. The URL was found using the browser's network tools whilst the page was loading.

Identifying issue in retrieving href from Google Scholar

Having trouble scraping links and article names from google scholar. I'm unsure if the issue is with my code or the xpath that I'm using to retrieve the data – or possibly both?
I've already spent the past few hours trying to debug/consulting other stackoverflow queries but to no success.
import scrapy
from scrapyproj.items import ScrapyProjItem
class scholarScrape(scrapy.Spider):
name = "scholarScraper"
allowed_domains = "scholar.google.com"
start_urls=["https://scholar.google.com/scholar?hl=en&oe=ASCII&as_sdt=0%2C44&q=rare+disease+discovery&btnG="]
def parse(self,response):
item = ScrapyProjItem()
item['hyperlink'] = item.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = item.xpath("//div[#class='gs_rt']/h3").extract()
yield item
The error messages I have been receiving say: "AttributeError: xpath" so I believe that the issue lies with the path that I'm using to try and retrieve the data, but I could also be mistaken?
Adding my comment as an answer, as it solved the problem:
The issue is with scrapyproj.items.ScrapyProjItem objects: they do not have an xpath attribute. Is this an official scrapy class? I think you meant to call xpath on response:
item['hyperlink'] = response.xpath("//h3[class=gs_rt]/a/#href").extract()
item['name'] = response.xpath("//div[#class='gs_rt']/h3").extract()
Also, the first path expression might need a set of quotes around the attribute value "gs_rt":
item['hyperlink'] = response.xpath("//h3[class='gs_rt']/a/#href").extract()
Apart from that, the XPath expressions are fine.
Alternative solution using bs4:
from bs4 import BeautifulSoup
import requests, lxml, os
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://scholar.google.com/citations?hl=en&user=m8dFEawAAAAJ', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
# Container where all articles located
for article_info in soup.select('#gsc_a_b .gsc_a_t'):
# title CSS selector
title = article_info.select_one('.gsc_a_at').text
# Same title CSS selector, except we're trying to get "data-href" attribute
# Note, it will be relative link, so you need to join it with absolute link after extracting.
title_link = article_info.select_one('.gsc_a_at')['data-href']
print(f'Title: {title}\nTitle link: https://scholar.google.com{title_link}\n')
# Part of the output:
'''
Title: Automating Gödel's Ontological Proof of God's Existence with Higher-order Automated Theorem Provers.
Title link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=m8dFEawAAAAJ&citation_for_view=m8dFEawAAAAJ:-f6ydRqryjwC
'''
Alternatively, you can do the same with Google Scholar Author Articles API from SerpApi.
The main difference is that you don't have to think about finding good proxies, trying to solve CAPTCHA even if you're using selenium. It's a paid API with a free plan.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google_scholar_author",
"author_id": "9PepYk8AAAAJ",
}
search = GoogleSearch(params)
results = search.get_dict()
for article in results['articles']:
article_title = article['title']
article_link = article['link']
# Part of the output:
'''
Title: p-GaN gate HEMTs with tungsten gate metal for high threshold voltage and low gate current
Link: https://scholar.google.com/citations?view_op=view_citation&hl=en&user=9PepYk8AAAAJ&citation_for_view=9PepYk8AAAAJ:bUkhZ_yRbTwC
'''
Disclaimer, I work for SerpApi.

Scrape Google with Python - What is the correct URL for requests.get?

Goal: I would like to verify, if a specific Google search has a suggested result on the right hand side and - in case of such a suggestion - scrape some information like company type / address / etc.
Approach: I wanted to use a Python scraper with Requests and BeautifulSoup4
import bs4
import requests
address='https://www.google.co.ve/?gws_rd=cr&ei=DgBqVpWJMoPA-gHy25fACg#q=caracas+arepa'
page = requests.get(address)
soup = bs4.BeautifulSoup(page.content,'html.parser')
print (soup.prettify())
Issue:
The requested page does not include the search results (I am not sure if some variable on the Google page is set to invisible?), Rather only the header and footer of the Google page
Questions:
Alternative ways to obtain the described information? Any ideas?
Once I obtained results with the described method, but the respective address was constructed differently (I remember many numbers in the Google URL, but unfortunately cannot reproduce the search address). Therefore: Is there a requirement of the Google URL so that it can be scraped via requests.get?
The best way to get information from a service like Google Places will almost always be the official API. That said, if you're dead set on scraping, it's likely that what's returned by the HTTP request is meant for a browser to render. What BeautifulSoup does is not equivalent to rendering the data it receives, so it's very likely you're just getting useless empty containers that are later filled out dynamically.
I think your question is similar to google-search-with-python-reqeusts, maybe you could get some help from that~
And I agree with LiterallyElvis, API is better idea than crawl it directly.
Finally if you want to use requests for this work, I recommend to use PhantomJS and selenium to mock browser works, as Google should use some AJAX tech which makes different views between real browser and crawler.
As in country of difficult to visit Google, I couldn't repeat your problem directly, the above are sth I could think about, wish it helps
You need select_one() element (container) that contains all the needed data and check if it exists, and if so, scrape the data.
Make sure you're using user-agent to act as a "real" user visit, otherwise your request might be blocked or you receive a different HTML with different selectors. Check what's your user-agent.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "caracas arepa bar google",
"gl": "us"
}
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36"
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
# if right side knowledge graph is present -> parse the data.
if soup.select_one(".liYKde"):
place_name = soup.select_one(".PZPZlf.q8U8x span").text
place_type = soup.select_one(".YhemCb+ .YhemCb").text
place_reviews = soup.select_one(".hqzQac span").text
place_rating = soup.select_one(".Aq14fc").text
print(place_name, place_type, place_reviews, place_rating, sep="\n")
# output:
'''
Caracas Arepa Bar
Venezuelan restaurant
1,123 Google reviews
4.5
'''
Alternatively, you can achieve the same thing using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan.
The biggest difference is that you don't need to figure out how to parse the data, increase the number of requests, bypass blocks from Google, and other search engines.
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "caracas arepa bar place",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
print(json.dumps([results["knowledge_graph"]], indent=2))
# part of the output:
'''
[
{
"title": "Caracas Arepa Bar",
"type": "Venezuelan restaurant",
"place_id": "ChIJVcQ2ll9ZwokRwmkvsArPXyo",
"website": "http://caracasarepabar.com/",
"description": "Arepa specialist offering creative, low-priced renditions of the Venezuelan corn-flour staple.",
"local_map": {
"image": "https://www.google.com/maps/vt/data=TF2Rd51PtEnU2M3pkZHYHKdSwhMDJ_ZwRfg0vfwlDRAmv1u919sgFl8hs_lo832ziTWxCZM9BKECs6Af-TA1hh0NLjuYAzOLFA1-RBEmj-8poygymcRX2KLNVTGGZZKDerZrKW6fnkONAM4Ui-BVN8XwFrwigoqqxObPg8bqFIgeM3LPCg",
"link": "https://www.google.com/maps/place/Caracas+Arepa+Bar/#40.7131972,-73.9574167,15z/data=!4m2!3m1!1s0x0:0x2a5fcf0ab02f69c2?sa=X&hl=en",
"gps_coordinates": {
"latitude": 40.7131972,
"longitude": -73.9574167,
"altitude": 15
}
} ... much more results including place images, popular times, user reviews.
}
]
'''
Disclaimer: I work for SerpApi.

Scraping and parsing Google search results using Python

I asked a question on realizing a general idea to crawl and save webpages.
Part of the original question is: how to crawl and save a lot of "About" pages from the Internet.
With some further research, I got some choices to go ahead with both on scraping and parsing (listed at the bottom).
Today, I ran into another Ruby discussion about how to scrape from Google search results. This provides a great alternative for my problem which will save all the effort on the crawling part.
The new question are: in Python, to scrape Google search results for a given keyword, in this case "About", and finally get the links for further parsing.
What are the best choices of methods and libraries to go ahead with? (in measure of easy-to-learn and easy-to-implement).
p.s. in this website, the exactly same thing is implemented, but closed and ask for money for more results. I'd prefer to do it myself if no open-source available and learn more Python in the meanwhile.
Oh, btw, advices for parsing the links from search results would be nice, if any. Still, easy-to-learn and easy-to-implement. Just started learning Python. :P
Final update, problem solved. Code using xgoogle, please read note in the section below in order to make xgoogle working.
import time, random
from xgoogle.search import GoogleSearch, SearchError
f = open('a.txt','wb')
for i in range(0,2):
wt = random.uniform(2, 5)
gs = GoogleSearch("about")
gs.results_per_page = 10
gs.page = i
results = gs.get_results()
#Try not to annnoy Google, with a random short wait
time.sleep(wt)
print 'This is the %dth iteration and waited %f seconds' % (i, wt)
for res in results:
f.write(res.url.encode("utf8"))
f.write("\n")
print "Done"
f.close()
Note on xgoogle (below answered by Mike Pennington):
The latest version from it's Github does not work by default already, due to changes in Google search results probably. These two replies (a b) on the home page of the tool give a solution and it is currently still working with this tweak. But maybe some other day it may stop working again due to Google's change/block.
Resources known so far:
For scraping, Scrapy seems to be a popular choice and a webapp called ScraperWiki is very interesting and there is another project extract it's library for offline/local usage. Mechanize was brought up quite several times in different discussions too.
For parsing HTML, BeautifulSoup seems to be the one of the most
popular choices. Of course. lxml too.
You may find xgoogle useful... much of what you seem to be asking for is there...
There is a twill lib for emulating browser. I used it when had a necessity to login with google email account. While it's a great tool with a great idea, it's pretty old and seems to have a lack of support nowadays (the latest version is released in 2007).
It might be useful if you want to retrieve results that require cookie-handling or authentication. Likely that twill is one of the best choices for that purposes.
BTW, it's based on mechanize.
As for parsing, you are right, BeautifulSoup and Scrapy are great. One of the cool things behind BeautifulSoup is that it can handle invalid HTML (unlike Genshi, for example.)
Have a look at this awesome urllib wrapper for web scraping https://github.com/mattseh/python-web/blob/master/web.py
Another option to scrape Google search results using Python is the one by ZenSERP.
I like the API-first approach which is easy to use and the JSON results are easily integrated into our solution.
Here is an example for a curl request:
curl "https://app.zenserp.com/api/search" -F "q=Pied Piper" -F "location=United States" -F "search_engine=google.com" -F "language=English" -H "apikey: APIKEY"
And the response:
{
"q": "Pied Piper",
"domain": "google.com",
"location": "United States",
"language": "English",
"url": "https://www.google.com/search?q=Pied%20Piper&num=100&hl=en&gl=US&gws_rd=cr&ie=UTF-8&oe=UTF-8&uule=w+CAIQIFISCQs2MuSEtepUEUK33kOSuTsc",
"total_results": 17100000,
"auto_correct": "",
"auto_correct_type": "",
"results": []
}
A Python code for example:
import requests
headers = {
'apikey': 'APIKEY',
}
params = (
('q', 'Pied Piper'),
('location', 'United States'),
('search_engine', 'google.com'),
('language', 'English'),
)
response = requests.get('https://app.zenserp.com/api/search', headers=headers, params=params)
To extract links from multiple pages of Google Search results you can use SerpApi. It's a paid API with a free trial.
Full example
import os
# Python package: https://pypi.org/project/google-search-results
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "about",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
pages = search.pagination()
for result in pages:
print(f"Current page: {result['serpapi_pagination']['current']}\n")
for organic_result in result["organic_results"]:
print(
f"Title: {organic_result['title']}\nLink: {organic_result['link']}\n"
)
Output
Current page: 12
URL: https://fi.google.com/
URL: https://www.mayoclinic.org/about-mayo-clinic
...
Current page: 18
URL: https://igem.org/About
URL: https://www.ieee.org/
URL: https://www.cancer.org/
...
Disclaimer: I work at SerpApi.
This one works good for this moment. If any search is made, the scraper keeps grabbing titles and their links traversing all next pages until there is no more next page is left or your ip address is banned. Make sure your bs4 version is >= 4.7.0 as I've used pseudo css selector within the script.
from bs4 import BeautifulSoup
from urllib.parse import urljoin
import requests
base = "https://www.google.de"
link = "https://www.google.de/search?q={}"
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.150 Safari/537.36'
}
def grab_content(link):
res = requests.get(link,headers=headers)
soup = BeautifulSoup(res.text,"lxml")
for container in soup.select("[class='g'] a[href^='http'][data-ved]:has(h3)"):
post_title = container.select_one("h3").get_text(strip=True)
post_link = container.get('href')
yield post_title,post_link
next_page = soup.select_one("a[href][id='pnnext']")
if next_page:
next_page_link = urljoin(base,next_page.get("href"))
yield from grab_content(next_page_link)
if __name__ == '__main__':
search_keyword = "python"
qualified_link = link.format(search_keyword.replace(" ","+"))
for item in grab_content(qualified_link):
print(item)
This can be done using google and beautifulsoup module, install it in CMD using command given below:
pip install google beautifulsoup4
Thereafter, run this simplified code given below
import webbrowser, googlesearch as gs
def direct(txt):
print(f"sure, searching '{txt}'...")
results=gs.search(txt,num=1,stop=1,pause=0)
#num, stop denotes number of search results you want
for link in results:
print(link)
webbrowser.open_new_tab(link)#to open the results in browser
direct('cheap thrills on Youtube') #this will play the song on YouTube
#(for this, keep num=1,stop=1)
Output:
TIP: Using this, you can also make a small Virtual Assistant that will open the top search result in browser for your given query(txt) in natural language.
Feel free to comment in case of difficulty while running this code:)
from urllib.request import urlopen
from bs4 import BeautifulSoup
import urllib.request
import re
import numpy as np
count=0
query=input("query>>")
query=query.strip().split()
query="+".join(query)
html = "https://www.google.co.in/search?site=&source=hp&q="+query+"&gws_rd=ssl"
req = urllib.request.Request(html, headers={'User-Agent': 'Mozilla/5.0'})
soup = BeautifulSoup(urlopen(req).read(),"html.parser")
#Regex
reg=re.compile(".*&sa=")
links = []
#Parsing web urls
for item in soup.find_all('h3', attrs={'class' : 'r'}):
line = (reg.match(item.a['href'][7:]).group())
links.append(line[:-4])
print(links)
this should be handy....for more go to -
https://github.com/goyal15rajat/Crawl-google-search.git
Here is a Python script using requests and BeautifulSoup to scrape Google results.
import urllib
import requests
from bs4 import BeautifulSoup
# desktop user-agent
USER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:65.0) Gecko/20100101 Firefox/65.0"
# mobile user-agent
MOBILE_USER_AGENT = "Mozilla/5.0 (Linux; Android 7.0; SM-G930V Build/NRD90M) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.125 Mobile Safari/537.36"
query = "hackernoon How To Scrape Google With Python"
query = query.replace(' ', '+')
URL = f"https://google.com/search?q={query}"
headers = {"user-agent": USER_AGENT}
resp = requests.get(URL, headers=headers)
if resp.status_code == 200:
soup = BeautifulSoup(resp.content, "html.parser")
results = []
for g in soup.find_all('div', class_='r'):
anchors = g.find_all('a')
if anchors:
link = anchors[0]['href']
title = g.find('h3').text
item = {
"title": title,
"link": link
}
results.append(item)
print(results)

Categories