No results in scraping bing search - python

i use code below to scrape results from bing and when I see the scraped web page it says "There are no results for python".
but when I search in the browser there is no problem.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = f'https://www.bing.com/search?q={term}&setlang=en-us'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())
I searched and I didn't find any similar problem

You need to pass the user-agent while requesting to get the value.
import requests
from bs4 import BeautifulSoup
term = 'python'
url = 'https://www.bing.com/search?q={}&setlang=en-us'.format(term)
headers = {'User-Agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36'}
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
print(soup.prettify())

Since Bing is a dynamic website, meaning Javascript generates the code, you won't be able to scrape it using only Beautifulsoup. Instead, I recommend selenium, which opens a browser that you can control and parse the code with Beautifulsoup.
The same will happen for any other dynamically coded website, including Google and many others.

It's probably because there's no user-agent being passed into request headers (as already mentioned by KunduK) thus when no user-agent is specified while using requests library, it defaults to python-requests so Bing or other search engines understands that it's a bot/script, then it blocks a request. Check what's your user-agent.
Pass user-agent:
headers = {
'User-agent':
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582'
}
requests.get('URL', headers=headers)
How to reduce the chance of being blocked while web scraping search engines.
Alternatively, you can achieve the same thing by using Bing Organic Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to spend time trying to bypass blocks from Bing or other search engines. Instead, focus on the data that needs to be extracted from the structured JSON. Check out the playground.
Disclaimer, I work for SerpApi.

Related

How can I get URLs from Oddsportal?

How can I get all the URLs from this particular link: https://www.oddsportal.com/results/#soccer
For every URL on this page, there are multiple pages e.g. the first link of the page:
https://www.oddsportal.com/soccer/africa/
leads to the below page as an example:
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations/results/#/page/2/...
https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/
-> https://www.oddsportal.com/soccer/africa/africa-cup-of-nations-2019/results/#/page/2/...
I would ideally like to code in python as I am pretty comfortable with it (more than other languages through not at all close to what I can call as comfortable)
and
After clicking on the link:
When I go to inspect element, I can see tha the links can be scraped however I am very new to it.
Please help
I have extracted the URLs from the main page that you mentioned.
import requests
import bs4 as bs
url = 'https://www.oddsportal.com/results/#soccer'
headers = {'User-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36'}
resp = requests.get(url, headers=headers)
soup = bs.BeautifulSoup(resp.text, 'html.parser')
base_url = 'https://www.oddsportal.com'
a = soup.findAll('a', attrs={'foo': 'f'})
# This set will have all the URLs of the main page
s = set()
for i in a:
s.add(base_url + i['href'])
Since you are new to web-scraping I suggest you to go through these.
Beautiful Soup - Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Docs: https://www.crummy.com/software/BeautifulSoup/bs4/doc/
requests - Requests is an elegant and simple HTTP library for Python.
Docs: https://docs.python-requests.org/en/master/
Selenium - Selenium is an umbrella project for a range of tools and libraries that enable and support the automation of web browsers.
Docs: https://selenium-python.readthedocs.io/

Beautiful Soup \u003b appears and messes up find_all?

I've been working on a web scraper for top news sites. Beautiful Soup in python has been a great tool, letting me get full articles with very simple code. BUT
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source=session.get(article_url, headers=request_header).text
soup = BeautifulSoup(source,'lxml')
#get all <p> paragraphs from article
paragraphs=soup.find_all('p')
#print each paragraph as a line
for paragraph in paragraphs:
print(paragraph)
This works great on most news sites I've tried BUT for some reason The AP site gives me no output at all. Which is strange because the exact same code works on maybe 10 other sites like the NYT, WaPo, and The Hill. And I know why.
What it does is, where every other site prints out all the paragraphs, it prints nothing. Except when I look at the paragraphs soup variable, here is the kind of thing I see:
address the pandemic.\u003c/p>\u003cdiv class=\"ad-placeholder\">\u003c/div>\u003cp>Instead, public schools
Clearly what's happening is the < HTML symbol is being translated as \u003b. And because of that find_all('p') can't properly find the HTML tags. But for some reason only the AP site is doing it. When I inspect the AP website, their html has the same symbols as all the other sites.
Does anyone have any idea why this is happening? Or what I can do to fix it? Because I'm seriously confused
For me, at least, I had to extract a javascript object containing the data with regex, then parse with json into json object, then grab the value associated with the page html as you see it in browser, soup it and then extract the paragraphs. I removed the retries stuff; you can easily re-insert that.
import requests
#from requests.adapters import HTTPAdapter
#from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re,json
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source = requests.get(article_url, headers=request_header).text
data = json.loads(re.search(r"window\['titanium-state'\] = (.*)", source, re.M).group(1))
content = data['content']['data']
content = content[list(content.keys())[0]]
soup = BeautifulSoup(content['storyHTML'])
for p in soup.select('p'):
print(p.text.strip())
Regex:

Empty list while scraping Google Search Result

I'm trying to scrape Google Search Result but all I'm getting as an output is empty list. Do you have any idea what's wrong here? I found the similar post on Stack Overflow where solution says you should try putting user_agent. I tried but it still returns nothing. Please share if you have any idea.
import requests, webbrowser
from bs4 import BeautifulSoup
user_input = input("Enter something to search:")
print("googling.....")
google_search = requests.get("https://www.google.com/search?q="+user_input)
# print(google_search.text)
soup = BeautifulSoup(google_search.text , 'html.parser')
# print(soup.prettify())
search_results = soup.select('.r a')
# print(search_results)
for link in search_results[:5]:
actual_link = link.get('href')
print(actual_link)
webbrowser.open('https://google.com/'+actual_link)
Google blocks your requests and threw this error This page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the Terms of Service. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help — a different computer using the same IP address may be responsible. Learn moreSometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly..
Try using selenium + python to get all the links
To get results from Google page, you have to specify User-Agent http header. For english results, add hl=en parameter to search URL:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
print(actual_link)
Prints:
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:wHCoEH9G9w8J:https://en.wikipedia.org/wiki/Tree+&cd=22&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAVegQIAxAH
https://simple.wikipedia.org/wiki/Tree
#
https://webcache.googleusercontent.com/search?q=cache:tNzOpY417g8J:https://simple.wikipedia.org/wiki/Tree+&cd=23&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://simple.wikipedia.org/wiki/Tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAWegQIARAH
https://www.britannica.com/plant/tree
#
https://webcache.googleusercontent.com/search?q=cache:91hg5d2649QJ:https://www.britannica.com/plant/tree+&cd=24&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://www.britannica.com/plant/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAXegQIAhAJ
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
#
https://webcache.googleusercontent.com/search?q=cache:AVSszZLtPiQJ:https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree+&cd=25&hl=en&ct=clnk&gl=sk
https://teamtrees.org/
#
https://webcache.googleusercontent.com/search?q=cache:gVbpYoK7meUJ:https://teamtrees.org/+&cd=26&hl=en&ct=clnk&gl=sk
https://www.ldoceonline.com/dictionary/tree
#
https://webcache.googleusercontent.com/search?q=cache:oyS4e3WdMX8J:https://www.ldoceonline.com/dictionary/tree+&cd=27&hl=en&ct=clnk&gl=sk
https://en.wiktionary.org/wiki/tree
#
https://webcache.googleusercontent.com/search?q=cache:s_tZIjpvHZIJ:https://en.wiktionary.org/wiki/tree+&cd=28&hl=en&ct=clnk&gl=sk
/search?hl=en&q=related:https://en.wiktionary.org/wiki/tree+tree&tbo=1&sa=X&ved=2ahUKEwjmroPTuZLqAhVWWs0KHV4oCtsQHzAbegQICBAH
https://www.dictionary.com/browse/tree
#
https://webcache.googleusercontent.com/search?q=cache:EhFIP6m4MuIJ:https://www.dictionary.com/browse/tree+&cd=29&hl=en&ct=clnk&gl=sk
https://www.treepeople.org/tree-benefits
#
https://webcache.googleusercontent.com/search?q=cache:4wLYFp4zTuUJ:https://www.treepeople.org/tree-benefits+&cd=30&hl=en&ct=clnk&gl=sk
EDIT: To filter results you can use this:
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:77.0) Gecko/20100101 Firefox/77.0'}
user_input = input("Enter something to search: ")
print("googling.....")
google_search = requests.get("https://www.google.com/search?hl=en&q="+user_input, headers=headers) # <-- add headers and hl=en parameter
soup = BeautifulSoup(google_search.text , 'html.parser')
search_results = soup.select('.r a')
for link in search_results:
actual_link = link.get('href')
if actual_link.startswith('#') or \
actual_link.startswith('https://webcache.googleusercontent.com') or \
actual_link.startswith('/search?'):
continue
print(actual_link)
Prints (for example):
Enter something to search: tree
googling.....
https://en.wikipedia.org/wiki/Tree
https://simple.wikipedia.org/wiki/Tree
https://www.britannica.com/plant/tree
https://www.knowablemagazine.org/article/living-world/2018/what-makes-tree-tree
https://teamtrees.org/
https://www.ldoceonline.com/dictionary/tree
https://en.wiktionary.org/wiki/tree
https://www.dictionary.com/browse/tree
https://www.treepeople.org/tree-benefits
Most websites nowadays use JavaScript to dynamically load their webpages. Google is one of those websites. In order for the full DOM (document object model) to load in, you need a Javascript engine, which beautifulsoup and requests don't have. Arun recommended selenium, and I do to, as it has an embedded Javascript engine.
Here is the Python Selenium documentation:
https://selenium-python.readthedocs.io/
The OP desired output doesn't come from JavaScript as Serket mentioned. All data that OP needed is located in the HTML.
There's no point in selenium as well for the same reason, it's all there, in the HTML, not rendered via JavaScript.
One of the problems as other people mentioned is because of no user-agent specified AND you possibly passed the wrong user-agent which leads to a completely different HTML that contains an error message or something similar. Check out what is your user-agent.
Pass user-agent:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get(YOUR_URL, headers=headers)
You can also grab attributes by passing them in square brackets:
element.get('href')
# is equivalent to
element['href']
Code and example in the online IDE (CSS selectors reference):
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "fus ro dah" # query
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with links and iterate over it
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
-------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.etsy.com/market/fus_ro_dah
https://www.nexusmods.com/skyrimspecialedition/mods/4889/
https://www.textualtees.com/products/fus-ro-dah-t-shirt
'''
Alternatively, you can achieve the same thing by using Google Search Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't need to figure out why or how to deal with such a problem since this part (extraction/scraping) is already done for the end-user. All that needs to be done is just to iterate over structured JSON and get what you want.
Code:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "fus ro day",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(result['link'])
---------
'''
https://elderscrolls.fandom.com/wiki/Unrelenting_Force_(Skyrim)
https://knowyourmeme.com/memes/fus-ro-dah
https://en.uesp.net/wiki/Skyrim:Unrelenting_Force
https://www.etsy.com/market/fus_ro_dah
https://www.urbandictionary.com/define.php?term=Fus%20ro%20dah
https://www.textualtees.com/products/fus-ro-dah-t-shirt
https://tenor.com/search/fus-ro-dah-gifs
'''
P.S - I have a blog post that covers a bit more in-depth how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

can't scrape amazon with beautifulsoup. not a header problem

i would like to scrape amazon top 10 bestsellers in baby-products.
i want just the titel text but it seems that i have a problem.
im getting 'None' when I'm trying this code.
after getting "result" i want to iterate it using "content" and print the titles.
thanks!
import requests
from bs4 import BeautifulSoup
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.71 Safari/537.36'}
url = "https://www.amazon.com/gp/bestsellers/baby-products"
r=requests.get(url, headers=headers)
print("status: ", r.status_code)
soup = BeautifulSoup(r.content, 'html.parser')
print("url: ", r.url)
result = soup.find("ol", {"id": "zg-ordered-list"})
content = result.findAll("div", {"class": "a-section a-spacing-none aok-relative"})
print(result)
print(content)
You won't be able to scrape the Amazon website in this way. You are using requests.get to get the HTTP response body of the url provided. Pay attention to what that response actually is (e.g. by print(r.content)). What you can see in your web browser is different than the raw HTTP response, because of client-side rendering technologies used by Amazon (typically JavaScript and others).
I advice you to use Selenium, which sorts of "emulates" the typical browser inside the Python runtime, renders the site like the normal browser would do and allows you to access properties of the same website you see in your web browser.

Beautiful Soup CSS selector not finding anything

I'm using Python 3. The code below is supposed to let the user enter a search term into the command line, after which it searches Google and runs through the HTML of the results page to find tags matching the CSS selector ('.r a').
Say we search for the term "cats." I know the tags I'm looking for exist on the "cats" search results page since I looked through the page source myself.
But when I run my code, the linkElems list is empty. What is going wrong?
import requests, sys, bs4
print('Googling...')
res = requests.get('http://google.com/search?q=' +' '.join(sys.argv[1:]))
print(res.raise_for_status())
soup = bs4.BeautifulSoup(res.text, 'html5lib')
linkElems = soup.select(".r a")
print(linkElems)
The ".r" class is rendered by Javascript, so it's not available in the HTML received. You can either render the javascript using selenium or similar or you can try a more creative solution to extracting the links from the tags. First check that the tags exist by finding them without the ".r" class. soup.find_all("a") Then as an example you can use regex to extract all urls beginning with "/url?q="
import re
linkelems = soup.find_all(href=re.compile("^/url\?q=.*"))
The parts you want to extract are not rendered by JavaScript as Matts mentioned and you don't need regex for such a task.
Make sure you're using user-agent otherwise Google will block your request eventually. That might be the reason why you were getting an empty output since you received a completely different HTML. Check what is your user-agent. I already answered about what is user-agent and HTTP headers.
Pass user-agent into HTTP headers:
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
html5lib is the slowest parser, try to use lxml instead, it's way faster. If you want to use even faster parser, have a look at selectolax.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "selena gomez"
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
Alternatively, you can achieve the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with the parsing part, instead, you only need to iterate over structured JSON and get the data you want, plus you don't have to maintain the parser over time.
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "selena gomez",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
link = result['link']
print(link)
----
'''
https://www.instagram.com/selenagomez/
https://www.selenagomez.com/
https://en.wikipedia.org/wiki/Selena_Gomez
https://www.imdb.com/name/nm1411125/
https://www.facebook.com/Selena/
https://www.youtube.com/channel/UCPNxhDvTcytIdvwXWAm43cA
https://www.vogue.com/article/selena-gomez-cover-april-2021
https://open.spotify.com/artist/0C8ZW7ezQVs4URX5aX7Kqx
'''
P.S - I wrote a blog post about how to scrape Google Organic Search Results.
Disclaimer, I work for SerpApi.

Categories