Why does BeautifulSoup not get all html on Google?

Why does BeautifulSoup not get all html on Google? - python

I've seen this before but I've never seen anything related to Google. When something is searched on Google, all of the links and titles are put in h3 tags. However, if i try to use Beautiful Soup, none of the h3 tags appear and it seems like a lot of the tags are missing. I don't think this is a JavaScript issue. Is there anything I'm missing?
link = "http://google.com/search?q=" + input
soup = BeautifulSoup(link, "lxml")
for item in soup.find_all("h3"):
print (item)
Edit: code

According to your code, you get the empty result because you didn't send any request for example via requests module as other people in the answers mentioned. You just passed it right into beautifulsoup and he doesn't know what to do with it.
Also, it's because there's no user-agent aka headers specified, and Google could block a request eventually. What is my user-agent
Code (CSS selectors reference):
import requests, lxml
from bs4 import BeautifulSoup
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "python memes",
"hl": "en"
}
soup = BeautifulSoup(requests.get('https://www.google.com/search', headers=headers, params=params).text, 'lxml')
# container will all title and links. Iterating over each title and link
for result in soup.select('.yuRUbf'):
title = result.select_one('.DKV0Md').text
url = result.a['href']
print(f'{title}, {url}\n')
---------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
...
'''
Alternatively, you can do the same thing using Google Organic Results API from SerpApi. It's a paid API with a free plan.
The differences are is that you only need to iterate over JSON string rather than scraping everything from scratch and don't worry about bypass blocks from Google or trying to understand how to scrape data from JavaScript, e.g. Google Maps or extract images from Google Images since it's already done for the end-user.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "python memes",
"hl": "en"
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
title = result['title']
url = result['link']
print(f'{title}, {url}\n')
---------
'''
35 Funny And Best Python Programming Memes - CodeItBro, https://www.codeitbro.com/funny-python-programming-memes/
ML Memes (#python.memes_) • Instagram photos and videos, https://www.instagram.com/python.memes_/?hl=en
...
'''
P.S. - I wrote a more detailed blog post about how to scrape Google Organic Results.
Disclaimer, I work for SerpApi.

Use requests to get the data. And narrow down your search in order to print only the titles:
from bs4 import BeautifulSoup
import requests
link = "http://google.com/search?q=" + input("Enter search")
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:32.0) Gecko/20100101 Firefox/32.0'}
r = requests.get(link, headers=headers)
soup = BeautifulSoup(r.text,'html5lib')
headings = soup.find_all('h3', class_ = 'LC20lb DKV0Md')
for heading in headings:
print(heading.text)
Output:
Enter search>? beautifulsoup
Beautiful Soup Documentation — Beautiful Soup 4.9.0 ...
Beautiful Soup: We called him Tortoise because he taught us.
Beautiful Soup documentation - Crummy
beautifulsoup4 · PyPI
Intro to Beautiful Soup | Programming Historian
Beautiful Soup (HTML parser) - Wikipedia
Implementing Web Scraping in Python with BeautifulSoup ...
Tutorial: Web Scraping with Python Using Beautiful Soup
Beautiful Soup - Quick Guide - Tutorialspoint

You need to first get the source code of the webpage using requests module and then you can pass it to BeautifulSoup constructor (also don't use input as variable name):
import requests
from bs4 import BeautifulSoup
Input = input("Enter search string: ")
link = "http://google.com/search?q=" + Input
html = requests.get(link).content
soup = BeautifulSoup(html, "lxml")
for item in soup.find_all("h3"):
print (item.text)
You can made the input fail safe too. In google search URL you should replace the spaces by + and + by %20:
import requests
from bs4 import BeautifulSoup
Input = input("Enter search string: ")
Input = Input.replace(" ","+").replace("+","%20")
link = "http://google.com/search?q=" + Input
html = requests.get(link).content
soup = BeautifulSoup(html, "lxml")
for item in soup.find_all("h3"):
print (item.text)
If you don't have the requests

Related

Missing Classes when scraping Google Pages with Beautiful Soup

I am looking to scrape the audience review score for a specific movie.
i.e.
I have been looking to scrape this value using BS4, however I cannot seem to find this rating anywhere when scraping the URL. The critic rating (IMDb, RT, Meta) can be located, however the Google User Score is not there.
I used SerpApi to double check this, and the value doesn't show up on their either. I.e.
I am assuming this means that the Google User Score, may be updated using some other Script, and is not retrievable by these means. Could there possible be another method of retrieving this data?
Looking through the reponse using python shows no result either.
html = requests.get('https://www.google.com/search?q=shawshank+redemption&hl=en')
soup = BeautifulSoup(html.text, 'lxml')

Try:
import re
import requests
from bs4 import BeautifulSoup
headers = {
"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:100.0) Gecko/20100101 Firefox/100.0"
}
url = "https://www.google.com/search?q=shawshank+redemption&hl=en"
soup = BeautifulSoup(requests.get(url, headers=headers).content, "html.parser")
liked = soup.find(
lambda tag: tag.name == "span" and re.match(r"\d+% liked", tag.text)
)
print(liked.text.split()[0])
Prints:
96%

Python: scraping google results for websites' main URL and title

I am trying to scrape a given number results from google search, but I so far I came across two problems: one is that I don't know how to join the URLs and the titles inside the same loop, so they can be shown together in the format:
(Title)
(Website URL)
(---------)
(Title)
(Website URL)
(---------)
I somehow managed to achieve this format, but the loop is going on several times, instead of just showing the top 10 results. I believe it's something to do with how I structured the loops to work together, but I don't know how to avoid this.
The other problem is that I want to obtain both main URL and title of each website within search results, but while I managed to get the right titles, I seem to be getting many links coming from the same website, instead of only the main URL. For instance, if I search for "data science", the second or third title shown is from Coursera, while the link is from wikipedia. I only want the main URL so the title matches the website URL, how do I get it?
Any input will be greatly appreciated
import requests
from bs4 import BeautifulSoup
import re
query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
soup_title = BeautifulSoup(requests_results.text,"html.parser")
links = soup_link.find_all("a")
heading_object=soup_title.find_all( 'h3' )
for link in links:
for info in heading_object:
get_title = info.getText()
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print(get_title)
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
print("------")

The length of your links doesn't seem to match your heading_object list. I think it's best if you filter it further than just "a".
Editing your solution, you can loop through links like this:
import requests
from bs4 import BeautifulSoup
import re
query = "data science"
search = query.replace(' ', '+')
results = 10
url = (f"https://www.google.com/search?q={search}&num={results}")
requests_results = requests.get(url)
soup_link = BeautifulSoup(requests_results.content, "html.parser")
links = soup_link.find_all("a")
for link in links:
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
title = link.find_all('h3')
if len(title) > 0:
print(link.get('href').split("?q=")[1].split("&sa=U")[0])
print(title[0].getText())
print("------")
Instead of keeping 2 lists for headers and links, we can get the header directly from the link. We do that by by doing another find_all('h3') inside the link object.
Since there are links that match url?q= format but are not part of the actual results you want to display, like the expanding accordion for related searches etc, we need to filter those out too. We can do that by checking if they have an "h3" header that's why we have len(title) > 0.

Try to use requests params as a dict, it's more readable e.g:
params = {
"q": "fus ro dah",
"hl": "en",
"gl": "us",
"num": "100"
}
requests.get('https://www.google.com/search', params=params)
Make sure you're using request headers and passing user-agent to act as a real user-visit. Otherwise Google will block your request eventually because default requests user-agent is python-requests. Check what's your user-agent.
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
You don't need to create several soups (BeautifulSoup() object), create only one instead and call it whenever it's needed. CSS selectors reference.
soup = BeautifulSoup(html.text, 'YOUR PARSER OF CHOISE') # try to use 'lxml', it's one of the fastest
# call it
soup.select()
soup.findAll()
soup.a.tag_parent
soup.p.next_element
for i in soup.select('css_selector'):
some_variable = i.select_one('css_selector')
Code and full example in the one IDE:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'data science',
'hl': 'en',
'num': '100'
}
html = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(html, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
title = result.select_one('.DKV0Md').text
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
try:
snippet = result.select_one('#rso .lyLwlc').text
except: snippet = None
print(f'{title}\n{link}\n{displayed_link}\n{snippet}\n')
print('---------------')
'''
Data Science Specialization - Coursera
https://www.coursera.org/specializations/jhu-data-science
https://www.coursera.org › ... › Data Analysis
Offered by Johns Hopkins University. Launch Your Career in Data Science. A ten-course introduction to data science, developed and taught by .
---------------
'''
Alternatively, you can do the same thing using Google Organic Results API from SerpAPI. It's a paid API with a free plan.
The main difference is that you only need to iterate over structured JSON and get the data you want without figuring out how to select certain elements and extract data from there or bypass Google blocks if they'll appear or if you don't want to deal with JavaScript websites, e.g. Google Maps.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # serpapi API key
"engine": "google", # search engine
"q": "data science", # search query
"hl": "en" # language of the search
}
search = GoogleSearch(params) # where data extraction happens
results = search.get_dict() # JSON -> Python dictionary
for result in results['organic_results']:
title = result['title']
link = result['link']
displayed_link = result['displayed_link']
snippet = result['snippet']
print(f"{title}\n{link}\n{displayed_link}\n{snippet}\n")
print('---------------')
'''
Data science - Wikipedia
https://en.wikipedia.org/wiki/Data_science
https://en.wikipedia.org › wiki › Data_science
Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured ...
---------------
'''
Disclaimer, I work for SerpApi.

Beautifulsoup is returning double links

I am trying to learn how to scrape websites and therefore not using an API. I am trying to scrape eBay's websites and my script will print double URL. I did my due diligence and search on Google/StackOverflow help but was unable to find any solution. Thanks in advance.
driver.get('https://www.ebay.com/sch/i.html?_from=R40&_nkw=watches&_sacat=0&_pgn=' + str(i))
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.maximize_window()
tempList = []
for link in soup.find_all('a', href=True):
if 'itm' in link['href']:
print(link['href'])
tempList.append(link['href'])
Entire code: https://pastebin.com/q41eh3Q6

Just add the class name while searching for all the links.Hope this helps.
i=1
driver.get('https://www.ebay.com/sch/i.html?_from=R40&_nkw=watches&_sacat=0&_pgn=' + str(i))
soup = BeautifulSoup(driver.page_source, 'lxml')
driver.maximize_window()
tempList = []
for link in soup.find_all('a',class_='s-item__link', href=True):
if 'itm' in link['href']:
print(link['href'])
tempList.append(link['href'])
print(len(tempList))

You're looking for this:
# container with needed data: title, link, price, condition, number of reviews, etc.
for item in soup.select('.s-item__wrapper.clearfix'):
# only link will be extracted from the container
link = item.select_one('.s-item__link')['href']
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.ebay.com/sch/i.html?_nkw=Wathces', headers=headers).text
soup = BeautifulSoup(html, 'lxml')
temp_list = []
for item in soup.select('.s-item__wrapper.clearfix'):
link = item.select_one('.s-item__link')['href']
temp_list.append(link)
print(link)
------------
'''
https://www.ebay.com/itm/203611966827?hash=item2f68380d6b:g:pBAAAOSw1~NhRy4Y
https://www.ebay.com/itm/133887696438?hash=item1f2c541e36:g:U3IAAOSwBKthN4yg
https://www.ebay.com/itm/154561925393?epid=26004285120&hash=item23fc9bd111:g:TWUAAOSwf3pgNP08
https://www.ebay.com/itm/115010872425?hash=item1ac72ea469:g:yQsAAOSweMBhT4gs
https://www.ebay.com/itm/115005461839?epid=1776383383&hash=item1ac6dc154f:g:QskAAOSwDe9hS7Ys
https://www.ebay.com/itm/224515689673?hash=item34462d8cc9:g:oTwAAOSwAO5gna8u
https://www.ebay.com/itm/124919898822?hash=item1d15ce62c6:g:iEoAAOSwhAthQnX9
https://www.ebay.com/itm/133886767671?hash=item1f2c45f237:g:htkAAOSwNAhhQOyf
https://www.ebay.com/itm/115005341920?hash=item1ac6da40e0:g:4SIAAOSwWi1hR5Mx
...
'''
Alternatively, you can achieve the same thing by using eBay Organic Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to deal with the extraction process and maintain it over time, instead, you only need to iterate over structured JSON and get the data you want.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "ebay",
"ebay_domain": "ebay.com",
"_nkw": "watches",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
temp_list = []
for result in results['organic_results']:
link = result['link']
temp_list.append(link)
print(link)
------------
'''
https://www.ebay.com/itm/203611966827?hash=item2f68380d6b:g:pBAAAOSw1~NhRy4Y
https://www.ebay.com/itm/133887696438?hash=item1f2c541e36:g:U3IAAOSwBKthN4yg
https://www.ebay.com/itm/154561925393?epid=26004285120&hash=item23fc9bd111:g:TWUAAOSwf3pgNP08
https://www.ebay.com/itm/115010872425?hash=item1ac72ea469:g:yQsAAOSweMBhT4gs
https://www.ebay.com/itm/115005461839?epid=1776383383&hash=item1ac6dc154f:g:QskAAOSwDe9hS7Ys
https://www.ebay.com/itm/224515689673?hash=item34462d8cc9:g:oTwAAOSwAO5gna8u
https://www.ebay.com/itm/124919898822?hash=item1d15ce62c6:g:iEoAAOSwhAthQnX9
https://www.ebay.com/itm/133886767671?hash=item1f2c45f237:g:htkAAOSwNAhhQOyf
https://www.ebay.com/itm/115005341920?hash=item1ac6da40e0:g:4SIAAOSwWi1hR5Mx
...
'''
P.S - I wrote a bit more in-depth blog post about how to scrape eBay search with Python.
Disclaimer, I work for SerpApi.

Webscraping Using BeautifulSoup: Retrieving source code of a website

Good day!
I am currently making a web scraper for Alibaba website.
My problem is that the returned source code does not show some parts that I am interested in. The data is there when I checked the source code using the browser, but I can't retrieve it when using BeautifulSoup.
Any tips?
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = urlopen(url).read()
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup2 = make_soup(url)
I am interested in the highlighted part as shown in the image using the Developer Tools of Chrome. But when I tried writing in a text file, some parts including the highlighted is nowhere to be found. Any tips? TIA!

You need to provide the User-Agent header at least.
Example using requests package instead of urllib2:
import requests
from bs4 import BeautifulSoup
def make_soup(url):
try:
html = requests.get(url, headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.80 Safari/537.36"}).content
except:
return None
return BeautifulSoup(html, "lxml")
url = "http://www.alibaba.com/Agricultural-Growing-Media_pid144"
soup = make_soup(url)
print(soup.select_one("a.next").get('href'))
Prints http://www.alibaba.com/catalogs/products/CID144/2.

Why am I getting repetitive output while trying to scrape data from Google Scholar?

I am trying to scrape the PDF links from the search results from Google Scholar. I have tried to set a page counter based on the change in URL, but after the first eight output links, I am getting repetitive links as output.
#!/usr/bin/env python
from mechanize import Browser
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import urllib2
import requests
#modifying the url as per page
urlCounter = 0
while urlCounter <=30:
urlPart1 = "http://scholar.google.com/scholar?start="
urlPart2 = "&q=%22entity+resolution%22&hl=en&as_sdt=0,4"
url = urlPart1 + str(urlCounter) + urlPart2
page = urllib2.Request(url,None,{"User-Agent":"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"})
resp = urllib2.urlopen(page)
html = resp.read()
soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
recordCount = 0
while recordCount <=9:
recordPart1 = "gs_ggsW"
finRecord = recordPart1 + str(recordCount)
recordCount = recordCount+1
#printing the links
for link in soup.find_all('div', id = finRecord):
linkstring = str(link)
soup1 = BeautifulSoup(linkstring)
for link in soup1.find_all('a'):
print(link.get('href'))

Change the following line in your code:
finRecord = recordPart1 + str(recordCount)
To
finRecord = recordPart1 + str(recordCount+urlCounter-10)
The real problem: div ids in the first page are gs_ggsW[0-9], but ids on the second page are gs_ggsW[10-19]. So beautiful soup will find no links on the 2nd page.
Python's variable scope may confuse people from other languages, like Java. After the for loop below being executed, the variable link still exists. So the link is referenced to the last link on the 1st page.
for link in soup1.find_all('a'):
print(link.get('href'))
Updates:
Google may not provide pdf download links for some papers, so you can't use id to match the link of each paper. You can use css selecters to match all the links together.
soup = BeautifulSoup(html)
urlCounter = urlCounter + 10
for link in soup.select('div.gs_ttss a'):
print(link.get('href'))

Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
Code and example in the online IDE to extract PDF's:
from bs4 import BeautifulSoup
import requests, lxml
params = {
"q": "entity resolution", # search query
"hl": "en" # language
}
# https://requests.readthedocs.io/en/master/user/quickstart/#custom-headers
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
soup = BeautifulSoup(html.text, "lxml")
for pdf_link in soup.select(".gs_or_ggsm a"):
pdf_file_link = pdf_link["href"]
print(pdf_file_link)
# output from the first page:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi. It's a paid API with a free plan.
The main difference is that you only need to grab the data from structured JSON instead of figuring out how to extract the data from HTML, how to bypass blocks from search engines.
Code to integrate:
from serpapi import GoogleSearch
params = {
"api_key": "YOUR_API_KEY", # SerpApi API key
"engine": "google_scholar", # Google Scholar organic reuslts
"q": "entity resolution", # search query
"hl": "en" # language
}
search = GoogleSearch(params)
results = search.get_dict()
for pdfs in results["organic_results"]:
for link in pdfs.get("resources", []):
pdf_link = link["link"]
print(pdf_link)
# output:
'''
https://linqs.github.io/linqs-website/assets/resources/getoor-vldb12-slides.pdf
http://ilpubs.stanford.edu:8090/859/1/2008-7.pdf
https://drum.lib.umd.edu/bitstream/handle/1903/4241/umi-umd-4070.pdf;sequence=1
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.169.9535&rep=rep1&type=pdf
https://arxiv.org/pdf/1208.1927
https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.77.6875&rep=rep1&type=pdf
http://da.qcri.org/ntang/pubs/vldb18-deeper.pdf
'''
If you want to scrape more data from organic results, there's a dedicated Scrape Google Scholar with Python blog post of mine.
Disclaimer, I work for SerpApi.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Why does BeautifulSoup not get all html on Google? - python

Related

Missing Classes when scraping Google Pages with Beautiful Soup

Python: scraping google results for websites' main URL and title

Beautifulsoup is returning double links

Webscraping Using BeautifulSoup: Retrieving source code of a website

Why am I getting repetitive output while trying to scrape data from Google Scholar?

Categories

Resources