extracting href from <a> beautiful soup - python

I'm trying to extract a link from a google search result. Inspect element tells me that the section I am interested in has "class = r". The first result looks like this:
<h3 class="r" original_target="https://en.wikipedia.org/wiki/chocolate" style="display: inline-block;">
<a href="https://en.wikipedia.org/wiki/Chocolate"
ping="/url?sa=t&source=web&rct=j&url=https://en.wikipedia.org/wiki/Chocolate&ved=0ahUKEwjW6tTC8LXZAhXDjpQKHSXSClIQFgheMAM"
saprocessedanchor="true">
Chocolate - Wikipedia
</a>
</h3>
To extract the "href" I do:
import bs4, requests
res = requests.get('https://www.google.com/search?q=chocolate')
googleSoup = bs4.BeautifulSoup(res.text, "html.parser")
elements= googleSoup.select(".r a")
elements[0].get("href")
But I unexpectedly get:
'/url?q=https://en.wikipedia.org/wiki/Chocolate&sa=U&ved=0ahUKEwjHjrmc_7XZAhUME5QKHSOCAW8QFggWMAA&usg=AOvVaw03f1l4EU9fYd'
Where I wanted:
"https://en.wikipedia.org/wiki/Chocolate"
The attribute "ping" seems to be confusing it. Any ideas?

What's happening?
If you print the response content (i.e. googleSoup.text) you'll see that you're getting a completely different HTML. The page source and the response content don't match.
This is not happening because the content is loaded dynamically; as even then, the page source and the response content are the same. (But the HTML you see while inspecting the element is different.)
A basic explanation for this is that Google recognizes the Python script and changes its response.
Solution:
You can pass a fake User-Agent to make the script look like a real browser request.
Code:
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36'}
r = requests.get('https://www.google.co.in/search?q=chocolate', headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
elements = soup.select('.r a')
print(elements[0]['href'])
Output:
https://en.wikipedia.org/wiki/Chocolate
Resources:
Sending “User-agent” using Requests library in Python
How to use Python requests to fake a browser visit?
Using headers with the Python requests library's get method

As the other answer mentioned, it's because there was no user-agent specified. The default requests user-agent is python-requests thus Google blocks a request because it knows that it's a bot and not a "real" user visit.
User-agent fakes user visit by adding this information into HTTP request headers. It can be done by passing custom headers (check what's yours user-agent):
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
Additionally, to get more accurate results you can pass URL parameters:
params = {
"q": "samurai cop, what does katana mean", # query
"gl": "in", # country to search from
"hl": "en" # language
# other parameters
}
requests.get("YOUR_URL", params=params)
Code and full example in the online IDE (code from another answer will throw an error because of CSS selector change):
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "samurai cop what does katana mean",
"gl": "in",
"hl": "en"
}
html = requests.get("https://www.google.com/search", headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
print(f'{title}\n{link}\n')
-------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
Samurai Cop (1991) - Mathew Karedas as Joe Marshall - IMDb
https://www.imdb.com/title/tt0130236/characters/nm0360481
...
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want fast, rather than figuring out why certain things don't work as they should and then maintain the parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "samurai cop what does katana mean",
"hl": "en",
"gl": "in",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['title'])
print(result['link'])
print()
------
'''
Samurai Cop - He speaks fluent Japanese - YouTube
https://www.youtube.com/watch?v=paTW3wOyIYw
Samurai Cop - What does "katana" mean? - Quotes.net
https://www.quotes.net/mquote/1060647
...
'''
Disclaimer, I work for SerpApi.

Related

Problem with webscraping google python beautiful soup

i am writing code:
i want to open some subpages which have been found.
import bs4
import requests
url = 'https://www.google.com/search?q=python'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
list_sites = soup.select('a[href]')
print(len(list_sites))
i want to open for example site in google like 'python' and then open some first links, but i have a problem with function select. What i should put inside to find links to
subpage? like a: Polish Python Coders Group - News, Welcome to Python.org, ...
I tried to put: a[href], a, h3 class but it doesnt work...
The wrong selector is selected in your code. Even if it worked, you wouldn't get what you wanted. Because you're selecting all the links on the page, not the ones that lead to websites.
To get these links, you need to get the selector that contains them. In our case, this is the .yuRUbf a selector. Let's use a select() method that will return a list of all the links we need.
To iterate over all links, we can use for loop and iterate the list of matched elements what select() method returned. Use get('href') or ['href'] to extract attributes.
for url in soup.select(".yuRUbf a"):
print(url.get("href"))
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "python",
"hl": "en", # language
"gl": "us" # country of the search, US -> USA
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for url in soup.select(".yuRUbf a"):
print(url.get("href"))
Output:
https://www.python.org/
https://en.wikipedia.org/wiki/Python_(programming_language)
https://www.w3schools.com/python/
https://www.w3schools.com/python/python_intro.asp
https://www.codecademy.com/catalog/language/python
https://www.geeksforgeeks.org/python-programming-language/
If you don't want to figure out how to build a reliable parser from scratch and maintain it, have a look at API solutions. For example Google Organic Results API from SerpApi.
Hello World example:
from serpapi import GoogleSearch
import os
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google", # search engine
"q": "python" # search query
# other parameters
}
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
for result in result_dict["organic_results"]:
print(result["link"])
Output:
https://www.python.org/
https://en.wikipedia.org/wiki/Python_(programming_language)
https://www.w3schools.com/python/
https://www.codecademy.com/catalog/language/python
https://www.geeksforgeeks.org/python-programming-language/
is this you need?
from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml
def print_extracted_data_from_url(url):
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
head_link = container.a['href']
print(head_link)
return soup.select_one('a#pnnext')
next_page_node = print_extracted_data_from_url('https://www.google.com/search?hl=en-US&q=python')

BeautifulSoup4 .get('href') returns not only the href, but some junk as well

I am writing a program which searches "jopa olega" in Google and prints the url of the first result
This is the code I am running:
import requests, webbrowser, bs4
res = requests.get("https://www.google.com/search?q=" + "jopa olega")
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
links = soup.select('div#main > div > div > div > a')
href = links[0].get('href') # <---- problem may be here
print(href)
What I expect to see:
https://pirozhki-ru.livejournal.com/990964.html
The actual output:
/url?q=https://pirozhki-ru.livejournal.com/990964.html&sa=U&ved=2ahUKEwjppYzLgKTlAhUMxosKHS5rDmkQFjAAegQIBBAB&usg=AOvVaw0UtLIaLS93pUQMWBngtgz7
This is the html of the link:
<a href="https://pirozhki-ru.livejournal.com/990964.html"
ping="/url?sa=t&source=web&rct=j&url=https://pirozhki-ru.livejournal.com/990964.html&ved=2ahUKEwiHn7P9h6TlAhURpIsKHRX5CRwQFjAAegQIAhAB">...
</a>
By the way, output is different each time. Does anyone know why that happens? Any help is appreciated. Thank you.
If you want to return only one element, use select_one() instead and then call for ['href'] attribute:
soup.select_one('.yuRUbf a')['href'] # return one element rather than a list()
You can access attributes in the square brackets instead of using get():
links[0].get('href')
links[0]['href']
soup.select_one('.yuRUbf a')['href'] # prints first link
Have a look at the SelectorGadged Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Make sure you're using user-agent, otherwise Google will block your request eventually. Check what's your user-agent.
Pass user-agent in request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
requests.get("https://www.google.com/search?q=" + "jopa olega") # no need for + symbol
requests.get("https://www.google.com/search?q=jopa olega")
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "jopa olega"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
first_link = soup.select_one('.yuRUbf a')['href']
print(first_link)
# https://ar-ar.facebook.com/public/Jopa-Olega
Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to scrape stuff since it's already done for the end-user. All that needs to be done is just to iterate over structured JSON and get the data you want without thinking to bypass blocks from Google or maintain a parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "jopa olega",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first index of search results
first_link = results['organic_results'][0]['link']
print(first_link)
# https://ar-ar.facebook.com/public/Jopa-Olega
Disclaimer, I work for SerpApi.

Using BeautifulSoup to scrape Google top feedback results for phone number

I'm a beginner at python. I trying to run a script that allows a person to input a university name to get a phone number back. The feedback google result is all i need. for example of search "university of alabama" then the word "phone"
but the result of running the code
brings me the result "None"
I need help getting down to the phone number in my scrape using beautiful soup.
Any suggestions?
ng
CSS selectors provided in answers by QHarr and Bitto Bennichan do not exist in the current Google Organic Results in HTML layout and it will throw an error (if using without try/except block).
Currently, it's this:
>>> phone = soup.select_one('.mw31Ze').text
"+1 205-348-6010"
Also, it was returning None to you because there's no user-agent specified thus Google blocked your request and you received a different HTML with some sort of error.
Because the default requests user-agent is python-requests. Google understands it and blocks a request since it's not the "real" user visit. Checks what's your user-agent.
Pass user-agent intro request headers:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
Code:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"hl": "en",
"gl": "uk" # contry to search from. uk = United Kingdom. us = United States
}
query = input("What would you like to search: ")
query = f"https://www.google.com/search?q={query} phone"
response = requests.get(query, headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
try:
phone = soup.select_one(".X0KC1c").text
except: phone = "not found"
print(phone)
'''
What would you like to search: university of alabama
+1 205-348-6010
'''
Alternatively, you can achieve the same thing by using Google Knowledge Graph API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you only need to iterate over structured JSON and get the data you want rather than figuring out why certain things break and don't work as they should, and you don't have to maintain the parser over time if some selectors will be changed and cause the parser to brake.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY",
"engine": "google",
"q": "university of alabama phone",
"gl": "uk",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
phone = results['knowledge_graph']['phone']
print(phone)
# +1 205-348-6010
Disclaimer, I work for SerpApi.
You are using the find method wrong. You need to give the name of the tag and then any attribute that you can use to identify the specific tag uniquely. You can use the inspect tool find the tag in which the phone number is present.
Also, you may need to find your user-agent and pass it as a header to request to get the exact same response from google. Just search "what is my user agent" in google to find your user agent.
from bs4 import BeautifulSoup
import requests
headers={
'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36'
}
r=requests.get('https://www.google.com/search?q=university+of+alabama+phone',headers=headers)
soup=BeautifulSoup(r.text,'html.parser')
ph_no=soup.find('div',class_='Z0LcW').text
print(ph_no)
Output
+1 205-348-6010
Documentation
find() method - BeautifulSoup
No guarantee this holds across all but you can use a css class selector to retrieve first result with select_one, wrapped in a try except
import requests
from bs4 import BeautifulSoup
query = input("What would you like to search: ")
query = query.replace(" ","+")
query = "https://www.google.com/search?q=" + query + "phone"
r = requests.get(query)
html_doc = r.text
soup = BeautifulSoup(html_doc, 'lxml')
try:
s = soup.select_one(".mrH1y").text
except:
s = "not found"
print(s)

Python Requests Google Custom Site Search Without API

I'm trying to create a webscraper which will get links from Google search result page. Everything works fine, but I want to search a specific site only, i.e., instead of test, I want to search for site:example.com test. The following is my current code:
import requests,re
from bs4 import BeautifulSoup
from urllib.parse import urlparse, parse_qs
s_term=input("Enter search term: ").replace(" ","+")
r = requests.get('http://www.google.com/search', params={'q':'"'+s_term+'"','num':"50","tbs":"li:1"})
soup = BeautifulSoup(r.content,"html.parser")
links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
links.append(item.a['href'])
print(links)
I tried using: ...params={'q':'"site%3Aexample.com+'+s_term+'"'... but it returns 0 results.
Change your existing params to the below one:
params={"source":"hp","q":"site:example.com test","oq":"site:example.com test","gs_l":"psy-ab.12...10773.10773.0.22438.3.2.0.0.0.0.135.221.1j1.2.0....0...1.2.64.psy-ab..1.1.135.6..35i39k1.zWoG6dpBC3U"}
You only need "q" params. Also, make sure you're using user-agent because Google might block your requests eventually thus you'll receive a completely different HTML. I already answered what is user-agent here.
Pass params:
params = {
"q": "site:example.com test"
}
requests.get("YOUR_URL", params=params)
Pass user-agent:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "site:example.com test"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
# http://example.com/
Alternatively, you can do the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to make stuff work since it's already done for the end-user and the only thing that needs to be done is to iterate over structured JSON and get what you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "site:example.com test",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
# http://example.com/
Disclaimer, I work for SerpApi.

Python: parse links from Google with search

I need to parse links with results after search in Google.
When I try to see code of page and Ctrl + U I can't find element with links, what I want.
But When I see code of elements with
Ctrl + Shift + I I can see what elem should I parse to get links.
I use code
url = 'https://www.google.ru/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=' + str(query)
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
But it returns empty list, becauses there are not this elements.
I think that html-code, that returns requests.get(url).content isn't full, so I can't get this elements.
I tried to use google.search but it returned error that it isn't used now.
Is any way to get links with search in google?
Try:
url = 'https://www.google.ru/search?q=' + str(query)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
links = soup.findAll('cite')
print([link.text for link in links])
For installing lxml, please see http://lxml.de/installation.html
*note: The reason I choose lxml instead html.parser is that sometimes I got incomplete result with html.parser and I don't know why
USe:
url = 'https://www.google.ru/search?q=name&rct=' + str(query)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
In order to get the actual response that you see in the browser, you need to send additional headers, more specifically user-agent (aside from sending additional query parameters) which is needed to act as a "real" user visit when the bot or browser sends a fake user-agent string to announce themselves as a different client.
That's why you were getting an empty output because you received a different HTML with different elements (CSS selectors, ID's, and so on).
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'minecraft', # query
'gl': 'us', # country to search from
'hl': 'en', # language
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it over time if something crashes.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "minecraft",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Disclaimer, I work for SerpApi.

Categories