Google search url given by href is wrong

Google search url given by href is wrong - python

It appears that google searches will give the following url:
/url?q= "URL WOULD BE HERE" &sa=U&ei=9LFsUbPhN47qqAHSkoGoDQ&ved=0CCoQFjAA&usg=AFQjCNEZ_f4a9Lnb8v2_xH0GLQ_-H0fokw
When subjected to a html parsing by BeautifulSoup.
I am getting the links by using soup.findAll('a') and then using a['href'].
More specifically, the code I have used is the following:
import urllib2
from BeautifulSoup import BeautifulSoup, SoupStrainer
import re
main_site = 'https://www.google.com/'
search = 'search?q='
query = 'pillows'
full_url = main_site+search+query
request = urllib2.Request(full_url, headers={'User-Agent': 'Chrome/16.0.912.77'})
main_html = urllib2.urlopen(request).read()
results = BeautifulSoup(main_html, parseOnlyThese=SoupStrainer('div', {'id': 'search'}))
try:
for search_hit in results.findAll('li', {'class':'g'}):
for elm in search_hit.findAll('h3',{'class':'r'}):
for a in elm.findAll('a',{'href':re.compile('.+')}):
print a['href']
except TypeError:
pass
Also, I have noticed on other sites that the a['href'] may return something like /dsoicjsdaoicjsdcj where the link would take you to website.com/dsoicjsdaoicjsdcj.
I know if this is the case that I can simply concatenate them, but I feel like it shouldn't be that I should have to change the way I parse up and treat the a['href'] based on which website I'm looking at. Is there a better way to get this link? Is there some javascript that I need to take into account? Surely there is a simply way in BeautifulSoup to get the full html to follow from a?

SoupStrainer('div', {'class': "vsc"})
returns nothing cause when you do:
print main_html
and search for "vsc", there is no result

You're looking for this:
# container with needed data: title, link, etc.
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
Also, while using requests library, you can pass URL params easily like so:
# this:
main_site = 'https://www.google.com/'
search = 'search?q='
query = 'pillows'
full_url = main_site+search+query
# could be translated to this:
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', params=params)
While using urllib you can do it like so (In python 3, this has been moved to urllib.parse.urlencode):
# https://stackoverflow.com/a/54050957/15164646
# https://stackoverflow.com/a/2506425/15164646
url = "https://disc.gsfc.nasa.gov/SSW/#keywords="
params = {'keyword':"(GPM_3IMERGHHE)", 't1':"2019-01-02", 't2':"2019-01-03", 'bboxBbox':"3.52,32.34,16.88,42.89"}
quoted_params = urllib.parse.urlencode(params)
# 'bboxBbox=3.52%2C32.34%2C16.88%2C42.89&t2=2019-01-03&keyword=%28GPM_3IMERGHHE%29&t1=2019-01-02'
full_url = url + quoted_params
# 'https://disc.gsfc.nasa.gov/SSW/#keywords=bboxBbox=3.52%2C32.34%2C16.88%2C42.89&t2=2019-01-03&keyword=%28GPM_3IMERGHHE%29&t1=2019-01-02'
resp = urllib.urlopen(full_url).read()
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'minecraft',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to make everything from scratch, bypass blocks, and maintain the parser over time.
Code to integrate to achieve your goal:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "minecraft",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Disclaimer, I work for SerpApi.

Related

Python: scraping google results for websites' main URL and title

I am trying to scrape a given number results from google search, but I so far I came across two problems: one is that I don't know how to join the URLs and the titles inside the same loop, so they can be shown together in the format:
(Title)
(Website URL)
(---------)
(Title)
(Website URL)
(---------)
I somehow managed to achieve this format, but the loop is going on several times, instead of just showing the top 10 results. I believe it's something to do with how I structured the loops to work together, but I don't know how to avoid this.
The other problem is that I want to obtain both main URL and title of each website within search results, but while I managed to get the right titles, I seem to be getting many links coming from the same website, instead of only the main URL. For instance, if I search for "data science", the second or third title shown is from Coursera, while the link is from wikipedia. I only want the main URL so the title matches the website URL, how do I get it?
Any input will be greatly appreciated
import requests
from bs4 import BeautifulSoup
import re
query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
soup_title = BeautifulSoup(requests_results.text,"html.parser")
links = soup_link.find_all("a")
heading_object=soup_title.find_all( 'h3' )
for link in links:
for info in heading_object:
get_title = info.getText()
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print(get_title)
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
print("------")

The length of your links doesn't seem to match your heading_object list. I think it's best if you filter it further than just "a".
Editing your solution, you can loop through links like this:
import requests
from bs4 import BeautifulSoup
import re
query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
links = soup_link.find_all("a")
for link in links:
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
title = link.find_all('h3')
if len(title) > 0:
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
print(title[0].getText())
print("------")
Instead of keeping 2 lists for headers and links, we can get the header directly from the link. We do that by by doing another find_all('h3') inside the link object.
Since there are links that match url?q= format but are not part of the actual results you want to display, like the expanding accordion for related searches etc, we need to filter those out too. We can do that by checking if they have an "h3" header that's why we have len(title) > 0.

Try to use requests params as a dict, it's more readable e.g:
params = {
"q": "fus ro dah",
"hl": "en",
"gl": "us",
"num": "100"
}
requests.get('https://www.google.com/search', params=params)
Make sure you're using request headers and passing user-agent to act as a real user-visit. Otherwise Google will block your request eventually because default requests user-agent is python-requests. Check what's your user-agent.
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
You don't need to create several soups (BeautifulSoup() object), create only one instead and call it whenever it's needed. CSS selectors reference.
soup = BeautifulSoup(html.text, 'YOUR PARSER OF CHOISE') # try to use 'lxml', it's one of the fastest
# call it
soup.select()
soup.findAll()
soup.a.tag_parent
soup.p.next_element
for i in soup.select('css_selector'):
some_variable = i.select_one('css_selector')
Code and full example in the one IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'data science',
'hl': 'en',
'num': '100'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
try:
snippet = result.select_one('#rso .lyLwlc').text
except: snippet = None
print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')
print('---------------')
'''
Data Science Specialization - Coursera
https://www.coursera.org/specializations/jhu-data-science
https://www.coursera.org › ... › Data Analysis
Offered by Johns Hopkins University. Launch Your Career in Data Science. A ten-course introduction to data science, developed and taught by .
---------------
'''
Alternatively, you can do the same thing using Google Organic Results API from SerpAPI. It's a paid API with a free plan.
The main difference is that you only need to iterate over structured JSON and get the data you want without figuring out how to select certain elements and extract data from there or bypass Google blocks if they'll appear or if you don't want to deal with JavaScript websites, e.g. Google Maps.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # serpapi API key
"engine": "google", # search engine
"q": "data science", # search query
"hl": "en" # language of the search
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
snippet = result['snippet']
print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")
print('---------------')
'''
Data science - Wikipedia
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org › wiki › Data_science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured ...
---------------
'''
Disclaimer, I work for SerpApi.

scraping google search results page data python

i want to scrape emails on search resulted query. but when i access to class with css selecter "select" and print it always shows empty list. How can i access .r class or "class=g"?
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"
responce = requests.get(url)
soup = BeautifulSoup(responce.text, "html.parser")
test = soup.select('.r')
print(test)

Your program is correct, but to get correct answer from Google, you need to specify User-Agent header:
import requests
from bs4 import BeautifulSoup
url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"
headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'}
responce = requests.get(url, headers=headers) # <-- specify custom header
soup = BeautifulSoup(responce.text, "html.parser")
test = soup.select('.r')
print(test)
Prints:
[<div class="r"><a href="https://www.yahoo.com/news/11-course-complete-computer-science-171322233.html" onmousedown="return rwt(this,'','','','1','AOvVaw2wM4TUxc_4V7s9GjeWTNAG','','2ahUKEwjt17Kk-YjnAhW2R0EAHcnsC3QQFjAAegQIAxAB','','',event)"><div class="TbwUpd"><img alt="https://...
...

To get the emails out of the Google Search results you need to use regex
# this regex needs possible modifications
re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', variable_where_to_search_from)
Code:
from bs4 import BeautifulSoup
import requests, lxml, re
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)"
"Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q="computer science ""usa" "#yahoo.com"', headers=headers)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
try:
snippet = result.select_one('.lyLwlc').text
except:
snippet = None
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', str(snippet))
email = '\n'.join(match_email).strip()
print(email)
----------
'''
ahmed_733#yahoo.com
yjzou#uguam.uog
yzou2002#yahoo.com
...
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
It doesn't extract emails using regex although it would be a great possible feature. The main difference is that much easier and faster to get things done rather than creating everything from scratch.
Code to integrate:
from serpapi import GoogleSearch
import re
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": '"computer science ""usa" "#yahoo.com"',
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
try:
snippet = result['snippet']
except:
snippet = None
match_email = re.findall(r'[\w\.-]+#[\w\.-]+\.\w+', str(snippet))
email = '\n'.join(match_email).strip()
print(email)
---------
'''
shaikotweb#yahoo.com
ahmed_733#yahoo.com
RPeterson#L1id.com
rj_peterson#yahoo.com
'''
Disclaimer, I work for SerpApi.

Exact website links from google through BeautifulSoup

I want to search google using BeautifulSoup and open the first link. But when I opened the link it shows error. The reason i think is that because google is not providing exact link of website, it has added several parameters in url. How to get exact url?
When i tried to use cite tag it worked but for big urls its creating problem.
The first link which i get using soup.h3.a['href'][7:] is:
'http://www.wikipedia.com/wiki/White_holes&sa=U&ved=0ahUKEwi_oYLLm_rUAhWJNI8KHa5SClsQFggbMAI&usg=AFQjCNGN-vlBvbJ9OPrnq40d0_b8M0KFJQ'
Here is my code:
import requests
from bs4 import Beautifulsoup
r = requests.get('https://www.google.com/search?q=site:wikipedia.com+Black+hole&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw')
soup = BeautifulSoup(r.text, "html.parser")
print(soup.h3.a['href'][7:])

You could split the returned string:
url = soup.h3.a['href'][7:].split('&')
print(url[0])

hope by clubbing all answer together presented above ,your code will look like
this:
from bs4 import BeautifulSoup
import requests
import csv
import os
import time
url = "https://www.google.co.in/search?q=site:wikipedia.com+Black+hole&dcr=0&gbv=2&sei=Nr3rWfLXMIuGvQT9xZOgCA"
r = requests.get(url)
data = r.text
url1 = "https://www.google.co.in"
soup = BeautifulSoup(data, "html.parser")
get_details = soup.find_all("div", attrs={"class":"g"})
final_data = []
for details in get_details:
link = details.find_all("h3")
#links = ""
for mdetails in link:
links = mdetails.find_all("a")
lmk = ""
for lnk in links:
lmk = lnk.get("href")[7:].split("&")
sublist = []
sublist.append(lmk[0])
final_data.append(sublist)
filename = "Google.csv"
with open("./"+filename, "w")as csvfile:
csvfile = csv.writer(csvfile, delimiter=",")
csvfile.writerow("")
for i in range(0, len(final_data)):
csvfile.writerow(final_data[i])

It's much simpler. You're looking for this:
# instead of this:
soup.h3.a['href'][7:].split('&')
# use this:
soup.select_one('.yuRUbf a')['href']
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "site:wikipedia.com black hole", # query
"gl": "us", # country to search from
"hl": "en" # language
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
first_link = soup.select_one('.yuRUbf a')['href']
print(first_link)
# https://en.wikipedia.com/wiki/Primordial_black_hole
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to extract the data from the structured JSON rather than figuring out why things don't work and then maintain it over time if some selectors will change.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "site:wikipedia.com black hole",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first index of search results
first_link = results['organic_results'][0]['link']
print(first_link)
# https://en.wikipedia.com/wiki/Primordial_black_hole
Disclaimer, I work for SerpApi.

Python: parse links from Google with search

I need to parse links with results after search in Google.
When I try to see code of page and Ctrl + U I can't find element with links, what I want.
But When I see code of elements with
Ctrl + Shift + I I can see what elem should I parse to get links.
I use code
url = 'https://www.google.ru/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=' + str(query)
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
But it returns empty list, becauses there are not this elements.
I think that html-code, that returns requests.get(url).content isn't full, so I can't get this elements.
I tried to use google.search but it returned error that it isn't used now.
Is any way to get links with search in google?

Try:
url = 'https://www.google.ru/search?q=' + str(query)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
links = soup.findAll('cite')
print([link.text for link in links])
For installing lxml, please see http://lxml.de/installation.html
*note: The reason I choose lxml instead html.parser is that sometimes I got incomplete result with html.parser and I don't know why

USe:
url = 'https://www.google.ru/search?q=name&rct=' + str(query)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')

In order to get the actual response that you see in the browser, you need to send additional headers, more specifically user-agent (aside from sending additional query parameters) which is needed to act as a "real" user visit when the bot or browser sends a fake user-agent string to announce themselves as a different client.
That's why you were getting an empty output because you received a different HTML with different elements (CSS selectors, ID's, and so on).
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'minecraft', # query
'gl': 'us', # country to search from
'hl': 'en', # language
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it over time if something crashes.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "minecraft",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Disclaimer, I work for SerpApi.

beautiful soup extract a href from google search

A google search gives me the following first result on HTML:
<h3 class="r"><em>Quantitative Trading</em>: <em>How to Build Your Own Algorithmic</em> <b>...</b> - Amazon</h3>
I would like to extract the link http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889 from this, but when I use beautiful soup to extract the information, I obtain
soup.find("h3").find("a").get("href")
I obtain the following string instead:
/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGeA7A
I know that the link is in there and I could parse it by deleting the /url?q= and everything after the & symbol, but I was wondering if there was a cleaner solution.
Thanks!

You can use a combination of urlparse.urlparse and urlparse.parse_qs, e.g
>>> import urlparse
>>> url = '/url?q=http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889&sa=U&ei=P2ycT6OoNuasiAL2ncV5&ved=0CBIQFjAA&usg=AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'
>>> data = urlparse.parse_qs(
... urlparse.urlparse(url).query
... )
>>> data
{'ei': ['P2ycT6OoNuasiAL2ncV5'],
'q': ['http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'],
'sa': ['U'],
'usg': ['AFQjCNEo_ujANAKnjheWDRlBKnJ1BGe'],
'ved': ['0CBIQFjAA']}
>>> data['q'][0]
'http://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889'

To extract only the first result from the page you can use select_one() by passing a CSS selectors or find() bs4 methods.
Code and example in the online IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
# passing parameters in URLs
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {'q': 'Quantitative Trading How to Build Your Own Algorithmic - amazon'}
def bs4_get_first_googlesearch():
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
first_link = soup.select_one('.yuRUbf').a['href']
print(first_link)
bs4_get_first_googlesearch()
# output:
'''
https://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
'''
Alternatively, you can do the same thing using Google Search Engine Results API from SerpApi. It's a paid API with a free trial of 5,000 searches. Check out the playground.
The big difference is that everything is already done for the end-user: selecting elements, bypass blocking, proxy rotation, and more.
Code to integrate:
from serpapi import GoogleSearch
import os
def serpapi_get_first_googlesearch():
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "Quantitative Trading How to Build Your Own Algorithmic - amazon",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first element from the search results
first_link = results['organic_results'][0]['link']
print(first_link)
serpapi_get_first_googlesearch()
# output:
'''
https://www.amazon.com/Quantitative-Trading-Build-Algorithmic-Business/dp/0470284889
'''
Disclaimer, I work for SerpApi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Google search url given by href is wrong - python

SoupStrainer('div', {'class': "vsc"}) returns nothing cause when you do: print main_html and search for "vsc", there is no result

Related

Python: scraping google results for websites' main URL and title

scraping google search results page data python

Exact website links from google through BeautifulSoup

Python: parse links from Google with search

beautiful soup extract a href from google search

Categories

Resources