Beautiful Soup CSS selector not finding anything - python

I'm using Python 3. The code below is supposed to let the user enter a search term into the command line, after which it searches Google and runs through the HTML of the results page to find tags matching the CSS selector ('.r a').
Say we search for the term "cats." I know the tags I'm looking for exist on the "cats" search results page since I looked through the page source myself.
But when I run my code, the linkElems list is empty. What is going wrong?
import requests, sys, bs4
print('Googling...')
res = requests.get('http://google.com/search?q=' +' '.join(sys.argv[1:]))
print(res.raise_for_status())
soup = bs4.BeautifulSoup(res.text, 'html5lib')
linkElems = soup.select(".r a")
print(linkElems)

The ".r" class is rendered by Javascript, so it's not available in the HTML received. You can either render the javascript using selenium or similar or you can try a more creative solution to extracting the links from the tags. First check that the tags exist by finding them without the ".r" class. soup.find_all("a") Then as an example you can use regex to extract all urls beginning with "/url?q="
import re
linkelems = soup.find_all(href=re.compile("^/url\?q=.*"))

The parts you want to extract are not rendered by JavaScript as Matts mentioned and you don't need regex for such a task.
Make sure you're using user-agent otherwise Google will block your request eventually. That might be the reason why you were getting an empty output since you received a completely different HTML. Check what is your user-agent. I already answered about what is user-agent and HTTP headers.
Pass user-agent into HTTP headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
html5lib is the slowest parser, try to use lxml instead, it's way faster. If you want to use even faster parser, have a look at selectolax.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with the parsing part, instead, you only need to iterate over structured JSON and get the data you want, plus you don't have to maintain the parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
link = result['link']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

Related

Problem with webscraping google python beautiful soup

i am writing code:
i want to open some subpages which have been found.
import bs4
import requests
url = 'https://www.google.com/search?q=python'
res = requests.get(url)
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, 'html.parser')
list_sites = soup.select('a[href]')
print(len(list_sites))
i want to open for example site in google like 'python' and then open some first links, but i have a problem with function select. What i should put inside to find links to
subpage? like a: Polish Python Coders Group - News, Welcome to Python.org, ...
I tried to put: a[href], a, h3 class but it doesnt work...
The wrong selector is selected in your code. Even if it worked, you wouldn't get what you wanted. Because you're selecting all the links on the page, not the ones that lead to websites.
To get these links, you need to get the selector that contains them. In our case, this is the .yuRUbf a selector. Let's use a select() method that will return a list of all the links we need.
To iterate over all links, we can use for loop and iterate the list of matched elements what select() method returned. Use get('href') or ['href'] to extract attributes.
for url in soup.select(".yuRUbf a"):
print(url.get("href"))
Also, make sure you're using request headers user-agent to act as a "real" user visit. Because default requests user-agent is python-requests and websites understand that it's most likely a script that sends a request. Check what's your user-agent.
Code and full example in online IDE:
from bs4 import BeautifulSoup
import requests, lxml
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "python",
"hl": "en", # language
"gl": "us" # country of the search, US -> USA
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.88 Safari/537.36",
}
html = requests.get("https://www.google.com/search", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for url in soup.select(".yuRUbf a"):
print(url.get("href"))
Output:
https://www.python.org/
https://en.wikipedia.org/wiki/Python_(programming_language)
https://www.w3schools.com/python/
https://www.w3schools.com/python/python_intro.asp
https://www.codecademy.com/catalog/language/python
https://www.geeksforgeeks.org/python-programming-language/
If you don't want to figure out how to build a reliable parser from scratch and maintain it, have a look at API solutions. For example Google Organic Results API from SerpApi.
Hello World example:
from serpapi import GoogleSearch
import os
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
# https://docs.python.org/3/library/os.html#os.getenv
"api_key": os.getenv("API_KEY"), # your serpapi api key
"engine": "google", # search engine
"q": "python" # search query
# other parameters
}
search = GoogleSearch(params) # where data extraction happens on the SerpApi backend
result_dict = search.get_dict() # JSON -> Python dict
for result in result_dict["organic_results"]:
print(result["link"])
Output:
https://www.python.org/
https://en.wikipedia.org/wiki/Python_(programming_language)
https://www.w3schools.com/python/
https://www.codecademy.com/catalog/language/python
https://www.geeksforgeeks.org/python-programming-language/
is this you need?
from bs4 import BeautifulSoup
import requests, urllib.parse
import lxml
def print_extracted_data_from_url(url):
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get(url, headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for container in soup.findAll('div', class_='tF2Cxc'):
head_link = container.a['href']
print(head_link)
return soup.select_one('a#pnnext')
next_page_node = print_extracted_data_from_url('https://www.google.com/search?hl=en-US&q=python')

No results in scraping bing search

i use code below to scrape results from bing and when I see the scraped web page it says "There are no results for python".
but when I search in the browser there is no problem.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = f'https://www.bing.com/search?q={term}&setlang=en-us'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
I searched and I didn't find any similar problem
You need to pass the user-agent while requesting to get the value.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = 'https://www.bing.com/search?q={}&setlang=en-us'.format(term)
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
Since Bing is a dynamic website, meaning Javascript generates the code, you won't be able to scrape it using only Beautifulsoup. Instead, I recommend selenium, which opens a browser that you can control and parse the code with Beautifulsoup.
The same will happen for any other dynamically coded website, including Google and many others.
It's probably because there's no user-agent being passed into request headers (as already mentioned by KunduK) thus when no user-agent is specified while using requests library, it defaults to python-requests so Bing or other search engines understands that it's a bot/script, then it blocks a request. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.

Empty list while scraping Google Search Result

I'm trying to scrape Google Search Result but all I'm getting as an output is empty list. Do you have any idea what's wrong here? I found the similar post on Stack Overflow where solution says you should try putting user_agent. I tried but it still returns nothing. Please share if you have any idea.
import requests, webbrowser
from bs4 import BeautifulSoup
user_input = input("Enter something to search:")
print("googling.....")
google_search = requests.get("https://www.google.com/search?q="+user_input)
# print(google_search.text)
soup = BeautifulSoup(google_search.text , 'html.parser')
# print(soup.prettify())
search_results = soup.select('.r a')
# print(search_results)
for link in search_results[:5]:
actual_link = link.get('href')
print(actual_link)
webbrowser.open('https://google.com/'+actual_link)
Google blocks your requests and threw this error This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Learn moreSometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly..
Try using selenium + python to get all the links
To get results from Google page, you have to specify User-Agent http header. For english results, add hl=en parameter to search URL:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
print(actual_link)
Prints:
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:wHCoEH9G9w8J:https://en.wikipedia.org/wiki/Tree+&cd=22&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAVegQIAxAH
https://simple.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:tNzOpY417g8J:https://simple.wikipedia.org/wiki/Tree+&cd=23&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://simple.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAWegQIARAH
https://www.britannica.com/plant/tree
#
https://webcache.googleusercontent.com/search?q=cache:91hg5d2649QJ:https://www.britannica.com/plant/tree+&cd=24&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://www.britannica.com/plant/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAXegQIAhAJ
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
#
https://webcache.googleusercontent.com/search?q=cache:AVSszZLtPiQJ:https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree+&cd=25&hl=en&ct=clnk&gl=sk
https://teamtrees.org/
#
https://webcache.googleusercontent.com/search?q=cache:gVbpYoK7meUJ:https://teamtrees.org/+&cd=26&hl=en&ct=clnk&gl=sk
https://www.ldoceonline.com/dictionary/tree
#
https://webcache.googleusercontent.com/search?q=cache:oyS4e3WdMX8J:https://www.ldoceonline.com/dictionary/tree+&cd=27&hl=en&ct=clnk&gl=sk
https://en.wiktionary.org/wiki/tree
#
https://webcache.googleusercontent.com/search?q=cache:s_tZIjpvHZIJ:https://en.wiktionary.org/wiki/tree+&cd=28&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wiktionary.org/wiki/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAbegQICBAH
https://www.dictionary.com/browse/tree
#
https://webcache.googleusercontent.com/search?q=cache:EhFIP6m4MuIJ:https://www.dictionary.com/browse/tree+&cd=29&hl=en&ct=clnk&gl=sk
https://www.treepeople.org/tree-benefits
#
https://webcache.googleusercontent.com/search?q=cache:4wLYFp4zTuUJ:https://www.treepeople.org/tree-benefits+&cd=30&hl=en&ct=clnk&gl=sk
EDIT: To filter results you can use this:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
if actual_link.startswith('#') or \
actual_link.startswith('https://webcache.googleusercontent.com') or \
actual_link.startswith('/search?'):
continue
print(actual_link)
Prints (for example):
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
https://simple.wikipedia.org/wiki/Tree
https://www.britannica.com/plant/tree
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
https://teamtrees.org/
https://www.ldoceonline.com/dictionary/tree
https://en.wiktionary.org/wiki/tree
https://www.dictionary.com/browse/tree
https://www.treepeople.org/tree-benefits
Most websites nowadays use JavaScript to dynamically load their webpages. Google is one of those websites. In order for the full DOM (document object model) to load in, you need a Javascript engine, which beautifulsoup and requests don't have. Arun recommended selenium, and I do to, as it has an embedded Javascript engine.
Here is the Python Selenium documentation:
https://selenium-python.readthedocs.io/
The OP desired output doesn't come from JavaScript as Serket mentioned. All data that OP needed is located in the HTML.
There's no point in selenium as well for the same reason, it's all there, in the HTML, not rendered via JavaScript.
One of the problems as other people mentioned is because of no user-agent specified AND you possibly passed the wrong user-agent which leads to a completely different HTML that contains an error message or something similar. Check out what is your user-agent.
Pass user-agent:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
You can also grab attributes by passing them in square brackets:
element.get('href')
# is equivalent to
element['href']
Code and example in the online IDE (CSS selectors reference):
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah" # query
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with links and iterate over it
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
-------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.textualtees.com/products/fus-ro-dah-t-shirt
'''
Alternatively, you can achieve the same thing by using Google Search Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out why or how to deal with such a problem since this part (extraction/scraping) is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get what you want.
Code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro day",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
---------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.etsy.com/market/fus_ro_dah
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.textualtees.com/products/fus-ro-dah-t-shirt
https://tenor.com/search/fus-ro-dah-gifs
'''
P.S - I have a blog post that covers a bit more in-depth how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

BeautifulSoup4 .get('href') returns not only the href, but some junk as well

I am writing a program which searches "jopa olega" in Google and prints the url of the first result
This is the code I am running:
import requests, webbrowser, bs4
res = requests.get("https://www.google.com/search?q=" + "jopa olega")
res.raise_for_status()
soup = bs4.BeautifulSoup(res.text, features="html.parser")
links = soup.select('div#main > div > div > div > a')
href = links[0].get('href') # <---- problem may be here
print(href)
What I expect to see:
https://pirozhki-ru.livejournal.com/990964.html
The actual output:
/url?q=https://pirozhki-ru.livejournal.com/990964.html&sa=U&ved=2ahUKEwjppYzLgKTlAhUMxosKHS5rDmkQFjAAegQIBBAB&usg=AOvVaw0UtLIaLS93pUQMWBngtgz7
This is the html of the link:
<a href="https://pirozhki-ru.livejournal.com/990964.html"
ping="/url?sa=t&source=web&rct=j&url=https://pirozhki-ru.livejournal.com/990964.html&ved=2ahUKEwiHn7P9h6TlAhURpIsKHRX5CRwQFjAAegQIAhAB">...
</a>
By the way, output is different each time. Does anyone know why that happens? Any help is appreciated. Thank you.
If you want to return only one element, use select_one() instead and then call for ['href'] attribute:
soup.select_one('.yuRUbf a')['href'] # return one element rather than a list()
You can access attributes in the square brackets instead of using get():
links[0].get('href')
links[0]['href']
soup.select_one('.yuRUbf a')['href'] # prints first link
Have a look at the SelectorGadged Chrome extension to grab CSS selectors by clicking on the desired element in your browser. CSS selectors reference.
Make sure you're using user-agent, otherwise Google will block your request eventually. Check what's your user-agent.
Pass user-agent in request headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get('YOUR_URL', headers=headers)
requests.get("https://www.google.com/search?q=" + "jopa olega") # no need for + symbol
requests.get("https://www.google.com/search?q=jopa olega")
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "jopa olega"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
first_link = soup.select_one('.yuRUbf a')['href']
print(first_link)
# https://ar-ar.facebook.com/public/Jopa-Olega
Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to scrape stuff since it's already done for the end-user. All that needs to be done is just to iterate over structured JSON and get the data you want without thinking to bypass blocks from Google or maintain a parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "jopa olega",
"hl": "en",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
# [0] - first index of search results
first_link = results['organic_results'][0]['link']
print(first_link)
# https://ar-ar.facebook.com/public/Jopa-Olega
Disclaimer, I work for SerpApi.

Python: parse links from Google with search

I need to parse links with results after search in Google.
When I try to see code of page and Ctrl + U I can't find element with links, what I want.
But When I see code of elements with
Ctrl + Shift + I I can see what elem should I parse to get links.
I use code
url = 'https://www.google.ru/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=' + str(query)
html = requests.get(url).content
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
But it returns empty list, becauses there are not this elements.
I think that html-code, that returns requests.get(url).content isn't full, so I can't get this elements.
I tried to use google.search but it returned error that it isn't used now.
Is any way to get links with search in google?
Try:
url = 'https://www.google.ru/search?q=' + str(query)
html = requests.get(url)
soup = BeautifulSoup(html.text, 'lxml')
links = soup.findAll('cite')
print([link.text for link in links])
For installing lxml, please see http://lxml.de/installation.html
*note: The reason I choose lxml instead html.parser is that sometimes I got incomplete result with html.parser and I don't know why
USe:
url = 'https://www.google.ru/search?q=name&rct=' + str(query)
html = requests.get(url).text
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('cite')
In order to get the actual response that you see in the browser, you need to send additional headers, more specifically user-agent (aside from sending additional query parameters) which is needed to act as a "real" user visit when the bot or browser sends a fake user-agent string to announce themselves as a different client.
That's why you were getting an empty output because you received a different HTML with different elements (CSS selectors, ID's, and so on).
You can read more about it in the blog post I wrote about how to reduce the chance of being blocked while web scraping.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
Code and example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
params = {
'q': 'minecraft', # query
'gl': 'us', # country to search from
'hl': 'en', # language
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link, sep='\n')
---------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Alternatively, you can achieve the same thing by using Google Organic API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to create it from scratch and maintain it over time if something crashes.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "minecraft",
"hl": "en",
"gl": "us",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
-------
'''
https://www.minecraft.net/en-us/
https://classic.minecraft.net/
https://play.google.com/store/apps/details?id=com.mojang.minecraftpe&hl=en_US&gl=US
https://en.wikipedia.org/wiki/Minecraft
'''
Disclaimer, I work for SerpApi.

Categories