Empty list while scraping Google Search Result - python

I'm trying to scrape Google Search Result but all I'm getting as an output is empty list. Do you have any idea what's wrong here? I found the similar post on Stack Overflow where solution says you should try putting user_agent. I tried but it still returns nothing. Please share if you have any idea.
import requests, webbrowser
from bs4 import BeautifulSoup
user_input = input("Enter something to search:")
print("googling.....")
google_search = requests.get("https://www.google.com/search?q="+user_input)
# print(google_search.text)
soup = BeautifulSoup(google_search.text , 'html.parser')
# print(soup.prettify())
search_results = soup.select('.r a')
# print(search_results)
for link in search_results[:5]:
actual_link = link.get('href')
print(actual_link)
webbrowser.open('https://google.com/'+actual_link)

Google blocks your requests and threw this error This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Learn moreSometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly..
Try using selenium + python to get all the links

To get results from Google page, you have to specify User-Agent http header. For english results, add hl=en parameter to search URL:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
print(actual_link)
Prints:
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:wHCoEH9G9w8J:https://en.wikipedia.org/wiki/Tree+&cd=22&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAVegQIAxAH
https://simple.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:tNzOpY417g8J:https://simple.wikipedia.org/wiki/Tree+&cd=23&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://simple.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAWegQIARAH
https://www.britannica.com/plant/tree
#
https://webcache.googleusercontent.com/search?q=cache:91hg5d2649QJ:https://www.britannica.com/plant/tree+&cd=24&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://www.britannica.com/plant/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAXegQIAhAJ
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
#
https://webcache.googleusercontent.com/search?q=cache:AVSszZLtPiQJ:https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree+&cd=25&hl=en&ct=clnk&gl=sk
https://teamtrees.org/
#
https://webcache.googleusercontent.com/search?q=cache:gVbpYoK7meUJ:https://teamtrees.org/+&cd=26&hl=en&ct=clnk&gl=sk
https://www.ldoceonline.com/dictionary/tree
#
https://webcache.googleusercontent.com/search?q=cache:oyS4e3WdMX8J:https://www.ldoceonline.com/dictionary/tree+&cd=27&hl=en&ct=clnk&gl=sk
https://en.wiktionary.org/wiki/tree
#
https://webcache.googleusercontent.com/search?q=cache:s_tZIjpvHZIJ:https://en.wiktionary.org/wiki/tree+&cd=28&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wiktionary.org/wiki/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAbegQICBAH
https://www.dictionary.com/browse/tree
#
https://webcache.googleusercontent.com/search?q=cache:EhFIP6m4MuIJ:https://www.dictionary.com/browse/tree+&cd=29&hl=en&ct=clnk&gl=sk
https://www.treepeople.org/tree-benefits
#
https://webcache.googleusercontent.com/search?q=cache:4wLYFp4zTuUJ:https://www.treepeople.org/tree-benefits+&cd=30&hl=en&ct=clnk&gl=sk
EDIT: To filter results you can use this:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
if actual_link.startswith('#') or \
actual_link.startswith('https://webcache.googleusercontent.com') or \
actual_link.startswith('/search?'):
continue
print(actual_link)
Prints (for example):
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
https://simple.wikipedia.org/wiki/Tree
https://www.britannica.com/plant/tree
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
https://teamtrees.org/
https://www.ldoceonline.com/dictionary/tree
https://en.wiktionary.org/wiki/tree
https://www.dictionary.com/browse/tree
https://www.treepeople.org/tree-benefits

Most websites nowadays use JavaScript to dynamically load their webpages. Google is one of those websites. In order for the full DOM (document object model) to load in, you need a Javascript engine, which beautifulsoup and requests don't have. Arun recommended selenium, and I do to, as it has an embedded Javascript engine.
Here is the Python Selenium documentation:
https://selenium-python.readthedocs.io/

The OP desired output doesn't come from JavaScript as Serket mentioned. All data that OP needed is located in the HTML.
There's no point in selenium as well for the same reason, it's all there, in the HTML, not rendered via JavaScript.
One of the problems as other people mentioned is because of no user-agent specified AND you possibly passed the wrong user-agent which leads to a completely different HTML that contains an error message or something similar. Check out what is your user-agent.
Pass user-agent:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
You can also grab attributes by passing them in square brackets:
element.get('href')
# is equivalent to
element['href']
Code and example in the online IDE (CSS selectors reference):
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah" # query
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with links and iterate over it
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
-------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.textualtees.com/products/fus-ro-dah-t-shirt
'''
Alternatively, you can achieve the same thing by using Google Search Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out why or how to deal with such a problem since this part (extraction/scraping) is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get what you want.
Code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro day",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
---------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.etsy.com/market/fus_ro_dah
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.textualtees.com/products/fus-ro-dah-t-shirt
https://tenor.com/search/fus-ro-dah-gifs
'''
P.S - I have a blog post that covers a bit more in-depth how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

Related

Problem with webscraping google python beautiful soup

i am writing code:
i want to open some subpages which have been found.
import bs4
import requests
url = 'https://www.google.com/search?q=python'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
list_sites = soup.select('a[href]')
print(len(list_sites))
i want to open for example site in google like 'python' and then open some first links, but i have a problem with function select. What i should put inside to find links to
subpage? like a: Polish Python Coders Group - News, Welcome to Python.org, ...
I tried to put: a[href], a, h3 class but it doesnt work...
The wrong selector is selected in your code. Even if it worked, you wouldn't get what you wanted. Because you're selecting all the links on the page, not the ones that lead to websites.
To get these links, you need to get the selector that contains them. In our case, this is the .yuRUbf a selector. Let's use a select() method that will return a list of all the links we need.
To iterate over all links, we can use for loop and iterate the list of matched elements what select() method returned. Use get('href') or ['href'] to extract attributes.
for url in soup.select(".yuRUbf a"):
print(url.get("href"))
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "python",
"hl": "en", # language
"gl": "us" # country of the search, US -> USA
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for url in soup.select(".yuRUbf a"):
print(url.get("href"))
Output:
https://www.python.org/
https://en.wikipedia.org/wiki/Python_(programming_language)
https://www.w3schools.com/python/
https://www.w3schools.com/python/python_intro.asp
https://www.codecademy.com/catalog/language/python
https://www.geeksforgeeks.org/python-programming-language/
If you don't want to figure out how to build a reliable parser from scratch and maintain it, have a look at API solutions. For example Google Organic Results API from SerpApi.
Hello World example:
from serpapi import GoogleSearch
import os
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google", # search engine
"q": "python" # search query
# other parameters
}
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
for result in result_dict["organic_results"]:
print(result["link"])
Output:
https://www.python.org/
https://en.wikipedia.org/wiki/Python_(programming_language)
https://www.w3schools.com/python/
https://www.codecademy.com/catalog/language/python
https://www.geeksforgeeks.org/python-programming-language/
is this you need?
from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml
def print_extracted_data_from_url(url):
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
head_link = container.a['href']
print(head_link)
return soup.select_one('a#pnnext')
next_page_node = print_extracted_data_from_url('https://www.google.com/search?hl=en-US&q=python')

No results in scraping bing search

i use code below to scrape results from bing and when I see the scraped web page it says "There are no results for python".
but when I search in the browser there is no problem.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = f'https://www.bing.com/search?q={term}&setlang=en-us'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
I searched and I didn't find any similar problem
You need to pass the user-agent while requesting to get the value.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = 'https://www.bing.com/search?q={}&setlang=en-us'.format(term)
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
Since Bing is a dynamic website, meaning Javascript generates the code, you won't be able to scrape it using only Beautifulsoup. Instead, I recommend selenium, which opens a browser that you can control and parse the code with Beautifulsoup.
The same will happen for any other dynamically coded website, including Google and many others.
It's probably because there's no user-agent being passed into request headers (as already mentioned by KunduK) thus when no user-agent is specified while using requests library, it defaults to python-requests so Bing or other search engines understands that it's a bot/script, then it blocks a request. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.

Beautiful Soup CSS selector not finding anything

I'm using Python 3. The code below is supposed to let the user enter a search term into the command line, after which it searches Google and runs through the HTML of the results page to find tags matching the CSS selector ('.r a').
Say we search for the term "cats." I know the tags I'm looking for exist on the "cats" search results page since I looked through the page source myself.
But when I run my code, the linkElems list is empty. What is going wrong?
import requests, sys, bs4
print('Googling...')
res = requests.get('http://google.com/search?q=' +' '.join(sys.argv[1:]))
print(res.raise_for_status())
soup = bs4.BeautifulSoup(res.text, 'html5lib')
linkElems = soup.select(".r a")
print(linkElems)
The ".r" class is rendered by Javascript, so it's not available in the HTML received. You can either render the javascript using selenium or similar or you can try a more creative solution to extracting the links from the tags. First check that the tags exist by finding them without the ".r" class. soup.find_all("a") Then as an example you can use regex to extract all urls beginning with "/url?q="
import re
linkelems = soup.find_all(href=re.compile("^/url\?q=.*"))
The parts you want to extract are not rendered by JavaScript as Matts mentioned and you don't need regex for such a task.
Make sure you're using user-agent otherwise Google will block your request eventually. That might be the reason why you were getting an empty output since you received a completely different HTML. Check what is your user-agent. I already answered about what is user-agent and HTTP headers.
Pass user-agent into HTTP headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
html5lib is the slowest parser, try to use lxml instead, it's way faster. If you want to use even faster parser, have a look at selectolax.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with the parsing part, instead, you only need to iterate over structured JSON and get the data you want, plus you don't have to maintain the parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
link = result['link']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

soup.select('.r a') in f'https://google.com/search?q={query}' brings back empty list in Python BeautifulSoup. **NOT A DUPLICATE**

The Situation:
The "I'm Feeling Lucky!" project in the "Automate the boring stuff with Python" ebook no longer works with the code he provided.
Specifically:
linkElems = soup.select('.r a')
What I have done:
I've already tried using the solution provided within this stackoverflow question
I'm also currently using the same search format.
Code:
import webbrowser, requests, bs4
def im_feeling_lucky():
# Make search query look like Google's
search = '+'.join(input('Search Google: ').split(" "))
# Pull html from Google
print('Googling...') # display text while downloading the Google page
res = requests.get(f'https://google.com/search?q={search}&oq={search}')
res.raise_for_status()
# Retrieve top search result link
soup = bs4.BeautifulSoup(res.text, features='lxml')
# Open a browser tab for each result.
linkElems = soup.select('.r') # Returns empty list
numOpen = min(5, len(linkElems))
print('Before for loop')
for i in range(numOpen):
webbrowser.open(f'http://google.com{linkElems[i].get("href")}')
The Problem:
The linkElems variable returns an empty list [] and the program doesn't do anything past that.
The Question:
Could sombody please guide me to he correct way of handling this and perhaps explain why it isn't working?
I too had had the same problem while reading that book and found a solution for that problem.
replacing
soup.select('.r a')
with
soup.select('div#main > div > div > div > a')
will solve that issue
following is the code that will work
import webbrowser, requests, bs4 , sys
print('Googling...')
res = requests.get('https://google.com/search?q=' + ' '.join(sys.argv[1:]))
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text)
linkElems = soup.select('div#main > div > div > div > a')
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open('http://google.com' + linkElems[i].get("href"))
the above code takes input from commandline arguments
I took a different route. I saved the HTML from the request and opened that page, then I inspected the elements. It turns out that the page is different if I open it natively in the Chrome browser compared to what my python request is served. I identified the div with the class that appears to denote a result and supplemented that for the .r - in my case it was .kCrYT
#! python3
# lucky.py - Opens several Google Search results.
import requests, sys, webbrowser, bs4
print('Googling...') # display text while the google page is downloading
url= 'http://www.google.com.au/search?q=' + ' '.join(sys.argv[1:])
url = url.replace(' ','+')
res = requests.get(url)
res.raise_for_status()
# Retrieve top search result links.
soup=bs4.BeautifulSoup(res.text, 'html.parser')
# get all of the 'a' tags afer an element with the class 'kCrYT' (which are the results)
linkElems = soup.select('.kCrYT > a')
# Open a browser tab for each result.
numOpen = min(5, len(linkElems))
for i in range(numOpen):
webbrowser.open_new_tab('http://google.com.au' + linkElems[i].get('href'))
Different websites (for instance Google) generate different HTML codes to different User-Agents (this is how the web browser is identified by the website). Another solution to your problem is to use a browser User-Agent to ensure that the HTML code you obtain from the website is the same you would get by using "view page source" in your browser. The following code just prints the list of google search result urls, not the same as the book you've referenced but it's still useful to show the point.
#! python3
# lucky.py - Opens several Google search results.
import requests, sys, webbrowser, bs4
print('Please enter your search term:')
searchTerm = input()
print('Googling...') # display thext while downloading the Google page
url = 'http://google.com/search?q=' + ' '.join(searchTerm)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
res = requests.get(url, headers=headers)
res.raise_for_status()
# Retrieve top search results links.
soup = bs4.BeautifulSoup(res.content)
# Open a browser tab for each result.
linkElems = soup.select('.r > a') # Used '.r > a' instead of '.r a' because
numOpen = min(5, len(linkElems)) # there are many href after div class="r"
for i in range(numOpen):
# webbrowser.open('http://google.com' + linkElems[i].get('href'))
print(linkElems[i].get('href'))
There's actually no need to save the HTML file, and one of the reasons why response output is different from the one you see in the browser is that there are no headers being sent with the request, in this case, user-agent which will act as a "real" user visit (already written by Cucurucho).
When no user-agent is specified (when using requests library) it defaults to python-requests thus Google understands it, blocks a request and you receive a different HTML with different CSS selectors. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
To easier grab CSS selectors, have a look at the SelectorGadget extension to get CSS selectors by clicking on the desired element in your browser.
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'how to create minecraft server',
'gl': 'us',
'hl': 'en',
}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# [:5] - first 5 results
# container with needed data: title, link, snippet, etc.
for result in soup.select('.tF2Cxc')[:5]:
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''
Alternatively, you can achieve the same thing by using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to spend time thinking about how to bypass blocks from Google or what is the right CSS selector to parse the data, instead, you need to pass parameters (params) you want, and iterate over structured JSON and get the data you want.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "how to create minecraft server",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"][:5]:
print(result["link"], sep="\n")
----------
'''
https://help.minecraft.net/hc/en-us/articles/360058525452-How-to-Setup-a-Minecraft-Java-Edition-Server
https://www.minecraft.net/en-us/download/server
https://www.idtech.com/blog/creating-minecraft-server
https://minecraft.fandom.com/wiki/Tutorials/Setting_up_a_server
https://codewizardshq.com/how-to-make-a-minecraft-server/
'''
Disclaimer, I work for SerpApi.

Python: parse links from Google with search

I need to parse links with results after search in Google.
When I try to see code of page and Ctrl + U I can't find element with links, what I want.
But When I see code of elements with
Ctrl + Shift + I I can see what elem should I parse to get links.
I use code
url = 'https://www.google.ru/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=' + str(query)
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
But it returns empty list, becauses there are not this elements.
I think that html-code, that returns requests.get(url).content isn't full, so I can't get this elements.
I tried to use google.search but it returned error that it isn't used now.
Is any way to get links with search in google?
Try:
url = 'https://www.google.ru/search?q=' + str(query)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
links = soup.findAll('cite')
print([link.text for link in links])
For installing lxml, please see http://lxml.de/installation.html
*note: The reason I choose lxml instead html.parser is that sometimes I got incomplete result with html.parser and I don't know why
USe:
url = 'https://www.google.ru/search?q=name&rct=' + str(query)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
In order to get the actual response that you see in the browser, you need to send additional headers, more specifically user-agent (aside from sending additional query parameters) which is needed to act as a "real" user visit when the bot or browser sends a fake user-agent string to announce themselves as a different client.
That's why you were getting an empty output because you received a different HTML with different elements (CSS selectors, ID's, and so on).
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'minecraft', # query
'gl': 'us', # country to search from
'hl': 'en', # language
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it over time if something crashes.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "minecraft",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Disclaimer, I work for SerpApi.

Categories