I am scraping Google Scholar and have trouble getting the right XPath expression. When I inspect the wanted elements it returns me expressions like these:
//*[#id="gs_res_ccl_mid"]/div[2]/div[2]/div[3]/a[3]
//*[#id="gs_res_ccl_mid"]/div[3]/div/div[3]/a[3]
// *[#id="gs_res_ccl_mid"]/div[6]/div[2]/div[3]/a[3]
I ended up with the generic expression:
//*[#id="gs_res_ccl_mid"]//a[3]
Also tried the alternative, with similar results:
//*[#id="gs_res_ccl_mid"]/div*/div*/div*/a[3]
The output is something like (I can not post the entire result set because I dont't have 10 points of reputation):
[
'https://scholar.google.es/scholar?cites=5812018205123467454&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/citations?user=EOc3O8AAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/citations?user=nd8O1XQAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/scholar?cites=15483392402856138853&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=7733120668292842687&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=15761030700327980189&as_sdt=2005&sciodt=0,5&hl=es'
]
The problem with the output is that there are 3 unwanted elements extras and they all have this piece of text citations?user. What can I do to rid me off the unwanted elements?
My code:
def paperOthers(exp,atr=None):
thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []
for t in thread:
if atr == 0:
xThread = t.get_attribute('id')
elif atr == 1:
xThread = t.get_attribute('href')
else:
xThread = t.text
xArray.append(xThread)
return xArray
Which I call with:
rcites = paperOthers("//*[#id='gs_res_ccl_mid']//a[3]", 1)
Change the XPath to exclude the items with text.
rcites = paperOthers("//*[#id='gs_res_ccl_mid']//a[3][not(contains(.,'citations?user'))]",1)
XPath expression could be as simple as //*[#class="gs_fl"]/a[3]/#href:
//* selects all elements in the document until it hits a followed #class.
[#class="gs_fl"] selects element node with gs_fl class attribute.
/a[3] selects the third <a> element that is the child of the gs_fl class element.
/#href selects href attribute of an <a> element.
A w3schools XPath syntax reminder.
Code and full example in the online IDE:
from parsel import Selector
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "biology", # search query
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
# used to act as a "real" user visit
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
for cite_by in selector.xpath('//*[#class="gs_fl"]/a[3]/#href'):
cited_by_link = f"https://scholar.google.com/{cite_by.get()}"
print(cited_by_link)
# output:
"""
https://scholar.google.com//scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi.
It's a paid API with a free plan that you can use without the need to figure out how to scrape the data and maintain it over time, how to scale it without getting blocked by the search engine, find reliable proxy providers, or CAPTCHA solving services.
Example code to integrate:
from serpapi import GoogleScholarSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # scraping search engine
"q": "biology", # search query
"hl": "en" # langugage
}
search = GoogleScholarSearch(params)
results = search.get_dict()
for cited_by in results["organic_results"]:
cited_by_link = cited_by["inline_links"]["cited_by"]["link"]
print(cited_by_link)
# output:
"""
https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""
Disclaimer, I work for SerpApi.
Related
I'm trying to scrape google for related searches when given a list of keywords, and then output these related searches into a csv file. My problem is getting beautiful soup to identify the related searches html tags.
Here is an example html tag in the source code:
<div data-ved="2ahUKEwitr8CPkLT3AhVRVsAKHVF-C80QmoICKAV6BAgEEBE">iphone xr</div>
Here are my webdriver settings:
from selenium import webdriver
user_agent = 'Chrome/100.0.4896.60'
webdriver_options = webdriver.ChromeOptions()
webdriver_options.add_argument('user-agent={0}'.format(user_agent))
capabilities = webdriver_options.to_capabilities()
capabilities["acceptSslCerts"] = True
capabilities["acceptInsecureCerts"] = True
Here is my code as is:
queries = ["iphone"]
driver = webdriver.Chrome(options=webdriver_options, desired_capabilities=capabilities, port=4444)
df2 = []
driver.get("https://google.com")
time.sleep(3)
driver.find_element(By.CSS_SELECTOR, "[aria-label='Agree to the use of cookies and other data for the purposes described']").click()
# get_current_related_searches
for query in queries:
driver.get("https://google.com/search?q=" + query)
time.sleep(3)
soup = BeautifulSoup(driver.page_source, 'html.parser')
p = soup.find_all('div data-ved')
print(p)
d = pd.DataFrame({'loop': 1, 'source': query, 'from': query, 'to': [s.text for s in p]})
terms = d["to"]
df2.append(d)
time.sleep(3)
df = pd.concat(df2).reset_index(drop=False)
df.to_csv("related_searches.csv")
Its the p=soup.find_all which is incorrect I'm just not sure how to get BS to identify these specific html tags. Any help would be great :)
#jakecohensol, as you've pointed out, the selector in p = soup.find_all is wrong. The correct CSS selector: .y6Uyqe .AB4Wff.
Chrome/100.0.4896.60 User-Agent header is incorrect. Google blocks requests with such an agent string. With the full User-Agent string Google returns a proper HTML response.
Google Related Searches can be scraped without a browser. It will be faster and more reliable.
Here's your fixed code snippet (link to the full code in online IDE)
import time
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = {
"User-Agent": "Mozilla/5.0 (X11; CrOS x86_64 14526.89.0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.133 Safari/537.36"
}
queries = ["iphone", "pixel", "samsung"]
df2 = []
# get_current_related_searches
for query in queries:
params = {"q": query}
response = requests.get("https://google.com/search", params=params, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
p = soup.select(".y6Uyqe .AB4Wff")
d = pd.DataFrame(
{"loop": 1, "source": query, "from": query, "to": [s.text for s in p]}
)
terms = d["to"]
df2.append(d)
time.sleep(3)
df = pd.concat(df2).reset_index(drop=False)
df.to_csv("related_searches.csv")
Sample output:
,index,loop,source,from,to
0,0,1,iphone,iphone,iphone 13
1,1,1,iphone,iphone,iphone 12
2,2,1,iphone,iphone,iphone x
3,3,1,iphone,iphone,iphone 8
4,4,1,iphone,iphone,iphone 7
5,5,1,iphone,iphone,iphone xr
6,6,1,iphone,iphone,find my iphone
7,0,1,pixel,pixel,pixel 6
8,1,1,pixel,pixel,google pixel
9,2,1,pixel,pixel,pixel phone
10,3,1,pixel,pixel,pixel 6 pro
11,4,1,pixel,pixel,pixel 3
12,5,1,pixel,pixel,google pixel price
13,6,1,pixel,pixel,pixel 6 release date
14,0,1,samsung,samsung,samsung galaxy
15,1,1,samsung,samsung,samsung tv
16,2,1,samsung,samsung,samsung tablet
17,3,1,samsung,samsung,samsung account
18,4,1,samsung,samsung,samsung mobile
19,5,1,samsung,samsung,samsung store
20,6,1,samsung,samsung,samsung a21s
21,7,1,samsung,samsung,samsung login
Have a look at SelectorGadget Chrome extension to get CSS selector by clicking on desired element in your browser that returns a HTML element.
Check out what's your user agent, or find multiple user agents for mobile, tablet, PC, or different OS in order to rotate user agents which reduces the chance of being blocked a little bit.
The ideal scenario is to combine rotating user agents with rotated proxies (ideally residential), and CAPTCHA solver to solve Google CAPTCHA that will appear eventually.
As an alternative, there's a Google Search Engine Results API to scrape Google search results if you don't want to figure out how to create and maintain the parser from scratch, or how bypass blocks from Google (or other search engines).
Example code to integrate:
import os
from serpapi import GoogleSearch
queries = [
'banana',
'minecraft',
'apple stock',
'how to create a apple pie'
]
def serpapi_scrape_related_queries():
related_searches = []
for query in queries:
print(f'extracting related queries from query: {query}')
params = {
'api_key': os.getenv('API_KEY'), # your serpapi api key
'device': 'desktop', # device to retrive results from
'engine': 'google', # serpapi parsing engine
'q': query, # search query
'gl': 'us', # country of the search
'hl': 'en' # language of the search
}
search = GoogleSearch(params) # where data extracts on the backend
results = search.get_dict() # JSON -> dict
for result in results['related_searches']:
query = result['query']
link = result['link']
related_searches.append({
'query': query,
'link': link
})
pd.DataFrame(data=related_searches).to_csv('serpapi_related_queries.csv', index=False)
serpapi_scrape_related_queries()
Part of the dataframe output:
query link
0 banana benefits https://www.google.com/search?gl=us&hl=en&q=Ba...
1 banana republic https://www.google.com/search?gl=us&hl=en&q=Ba...
2 banana tree https://www.google.com/search?gl=us&hl=en&q=Ba...
3 banana meaning https://www.google.com/search?gl=us&hl=en&q=Ba...
4 banana plant https://www.google.com/search?gl=us&hl=en&q=Ba...
I'm writing a program that will scrape wind speed and direction data from Google. I've seen other results online where it works out fine, but for some reason, it's not working out for me. I am specifically interested in scraping the elements with "img" tags. Here is my code:
import requests
import bs4
import geocoder
lat, long = 40.776903698619975, -74.45007646247723
base_url = r"https://www.google.com/search?q="
geoc = geocoder.osm([lat, long], method='reverse').json["raw"]["address"]
search_query = geoc["state"] + " " + geoc["country"] + " wind conditions"
lowest_admin_levels = ("municipality", "town", "city", "county")
level_found = False
for level in lowest_admin_levels:
try:
search_query = geoc[level] + " " + search_query
level_found = True
break
except KeyError:
continue
url = base_url + search_query.replace(" ", "+")
print(url)
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, 'html.parser')
print(soup.find_all('img'))
The lat/long variables could be any coordinates, those are just examples. soup.find_all('img') returns just one "img" element, when in reality, the page has multiple "img"s containing arrows rotated according to the wind direction, which you can see in this link https://www.google.com/search?q=Morris+Township+New+Jersey+United+States+wind+conditions. Thank you!
As the comment already says, Google loads the images dynamically using JavaScript. The requests library and Beautiful soup are not able to get those JavaScript loaded images. That's why you need Selenium, to get those images.
Installation
pip install selenium
pip install webdriver-manager
Solution
import geocoder
# New imports
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
lat, long = 40.776903698619975, -74.45007646247723
BASE_URL = r"https://www.google.com/search?q="
geoc = geocoder.osm([lat, long], method='reverse').json["raw"]["address"]
search_query = geoc["state"] + " " + geoc["country"] + " wind conditions"
lowest_admin_levels = ("municipality", "town", "city", "county")
for level in lowest_admin_levels:
try:
search_query = geoc[level] + " " + search_query
break
except KeyError:
continue
url = BASE_URL + search_query.replace(" ", "+")
chrome_options = Options()
# The options make the browser headless, so you don't see it
# comment out those two lines to see whats happening
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options) # You could specify the path to the chrome driver instead
driver.get(url)
time.sleep(2)
imgs = driver.find_elements_by_tag_name('img') # all the image tags
for img in imgs:
image_source = img.get_attribute('src') # The src of the img tag
print(image_source)
When you remove the headless option, you will see what selenium “sees”. Using Selenium, you can also click around on the website and interact with it, as you would as a normal user.
It's doesn't require geocoder nor selenium. Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the desired element.
Also, you can get wind direction from the same element, e.g. class='wob_t' -> area-label:
<span class="wob_t" style="display:inline;text-align:right" aria-label="8 km/h From northwest Tuesday 10:00">8 km/h</span>
Which is the same as in the <img> element (look at alt):
<img src="//ssl.gstatic.com/m/images/weather/wind_unselected.svg" alt="8 km/h From northwest" style="transform-origin:50% 50%;transform:rotate(408deg);width:16px" aria-hidden="true" data-atf="1" data-frt="0" class="">
Code and full example that scrapes more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "london weather",
"hl": "en",
"gl": "us"
}
response = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(response, 'lxml')
for weather_result in soup.select('.wob_noe .wob_hw'):
try:
wind_speed = weather_result.select_one('.wob_t').text
'''
extracts elements, splits the string by a SPACE, and grabs 2nd and 4th index,
and then joins via SPACE. Or just use regex instead.
Example:
7 mph From northwest Sunday 9:00 AM ---> From northeast
'''
wind_direction = ' '.join(weather_result.select_one('.wob_t')['aria-label'].split(' ')[2:4])
print(f"{wind_speed}\n{wind_direction}\n")
except:
pass # or None instead
----------
'''
8 mph
From northeast
11 mph
From east
9 mph
From northeast
...
'''
Alternatively, you can use Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
Essentially, you don't need to figure out extracting part of the process and all that really needs to be done is just to iterate over structured JSON and use whatever you need from it (apart from bypass blocks from Google or maintain the parser over time).
Code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"engine": "google",
"q": "london weather",
"api_key": os.getenv("API_KEY"),
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
forecast = results['answer_box']['forecast']
print(json.dumps(forecast, indent=2))
----------
'''
[
{
"day": "Tuesday",
"weather": "Partly cloudy",
"temperature": {
"high": "72",
"low": "57"
},
"thumbnail": "https://ssl.gstatic.com/onebox/weather/48/partly_cloudy.png"
}
...
]
'''
Disclaimer, I work for SerpApi.
I'm using beautiful soup to find the first hit from a google search.
Looking for "Stack Overflow" it should find https://www.stackoverflow.com
The code is mainly taken from here However, it suddenly stopped working with results[0] being index out of range.
print results[0] IndexError: list index out of range
I suspect it's a cache problem as it was working fine and then stopped without changing the code. I've also rebooted and cleared the cache but still no results.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requests
import webbrowser # for webrowser, duh!
import re
#------------------------------------------------
def write_it(s, f):
# w for over write
file = open(f, "w")
file.write(s)
file.close()
#------------------------------------------------
def URL_encode_space(s):
return re.sub(r"\s", "%20", s)
#------------------------------------------------
def URL_decode_space(s):
return re.sub(r"%20", " ", s)
#------------------------------------------------
urlBase = "https://google.com"
searchRequest = "Stack Overflow"
print searchRequest
searchRequest = URL_encode_space(searchRequest)
# String literal for HTML quote
q = "%22" # is a "
numOfResults = 10
myURL = urlBase + "/search?q=" + q + searchRequest + q + "&num={" + str(numOfResults) + "}"
page = requests.get(myURL)
soup = BeautifulSoup(page.text, "html.parser")
links = soup.findAll("a")
results = []
for link in links:
link_href = link.get('href')
if "url?q=" in link_href and not "webcache" in link_href:
print (link.get('href').split("?q=")[1].split("&sa=U")[0])
results.append(link.get('href').split("?q=")[1].split("&sa=U")[0])
print results[0]
# open web browser?
webbrowser.open(myURL)
I can obviously check the 'len(results)' to remove the error but that doesn't explain why it no longer works.
Just like people said above it doesn't clear what could cause the problem.
Make sure you're using user agent.
I took this code from my other answer (scraping headings, summary, and links from google search results).
Code and full example:
from bs4 import BeautifulSoup
import requests
import json
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=java&oq=java',
headers=headers).text
soup = BeautifulSoup(html, 'lxml')
summary = []
for container in soup.findAll('div', class_='tF2Cxc'):
heading = container.find('h3', class_='LC20lb DKV0Md').text
article_summary = container.find('span', class_='aCOpRe').text
link = container.find('a')['href']
summary.append({
'Heading': heading,
'Article Summary': article_summary,
'Link': link,
})
print(json.dumps(summary, indent=2, ensure_ascii=False))
Alternatively, you can use Google Organic Results API from SerpApi to get these results.
It's a paid API with a free trial.
Part of JSON:
{
"position": 1,
"title": "Java | Oracle",
"link": "https://www.java.com/",
"displayed_link": "https://www.java.com",
"snippet": "Java Download. » What is Java? » Need Help? » Uninstall. About Java. Go Java Java Training Java + Greenfoot Oracle Code One Oracle Academy for ..."
}
Code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "stackoverflow",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Link: {result['link']}")
Output:
Link: https://stackoverflow.com/
Link: https://en.wikipedia.org/wiki/Stack_Overflow
Link: https://stackoverflow.blog/
Link: https://stackoverflow.blog/podcast/
Link: https://www.linkedin.com/company/stack-overflow
Link: https://www.crunchbase.com/organization/stack-overflow
Disclaimer, I work for SerpApi.
I am trying to extract the result of a search from Google news (vaccine for example) and provide some sentiment analysis based on the headline collected.
So far, I can't seem to find the correct tag to collect the headlines.
Here is my code:
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
headline_results = soup.find_all('div', class_="phYMDf nDgy9d")
for h in headline_results:
blob = TextBlob(h.get_text())
self.sentiment += blob.sentiment.polarity / len(headline_results)
self.subjectivity += blob.sentiment.subjectivity / len(headline_results)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
The result are always 0 for the sentiment and 0 for the subjectivity. I feel like the issue is with the class_="phYMDf nDgy9d".
If you browse into that link, you are going to see the finished state of page but requests.get does not exeute or load any more data other than the page you request. Luckily there is some data and you can scrape that. I suggest you to use html prettifier services like codebeautify to get better understanding about what the page structure is.
Also if you see classes like phYMDf nDgy9d be sure to avoid finding with them. They are minified versions of classes so at any moment if they change a part of the CSS code, the class you are looking for is going to get a new name.
What I did is probably overkill but, I managed to dig down to scrape specific parts and your code works now.
When you look at the prettier version of requested html file, necessary contents are in a div with an id of main shown above. Then it's children are starting with a div element Google Search, continuing with a style element and after one empty div element, there are post div elements. The last two elements in that children list are footer and script elements. We can cut these off with [3:-2] and then under that tree we have pure data (pretty much). If you check the remaining part of the code after the posts variable, you can understand it I think.
Here is the code:
from textblob import TextBlob
import requests, re
from bs4 import BeautifulSoup
from pprint import pprint
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
#print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
mainDiv = soup.find("div", {"id": "main"})
posts = [i for i in mainDiv.children][3:-2]
news = []
for post in posts:
reg = re.compile(r"^/url.*")
cursor = post.findAll("a", {"href": reg})
postData = {}
postData["headline"] = cursor[0].find("div").get_text()
postData["source"] = cursor[0].findAll("div")[1].get_text()
postData["timeAgo"] = cursor[1].next_sibling.find("span").get_text()
postData["description"] = cursor[1].next_sibling.find("span").parent.get_text().split("· ")[1]
news.append(postData)
pprint(news)
for h in news:
blob = TextBlob(h["headline"] + " "+ h["description"])
self.sentiment += blob.sentiment.polarity / len(news)
self.subjectivity += blob.sentiment.subjectivity / len(news)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
A few outputs:
[{'description': 'It comes after US health officials said last week they had '
'started a trial to evaluate a possible vaccine in Seattle. '
'The Chinese effort began on...',
'headline': 'China embarks on clinical trial for virus vaccine',
'source': 'The Star Online',
'timeAgo': '5 saat önce'},
{'description': 'Hanneke Schuitemaker, who is leading a team working on a '
'Covid-19 vaccine, tells of the latest developments and what '
'needs to be done now.',
'headline': 'Vaccine scientist: ‘Everything is so new in dealing with this '
'coronavirus’',
'source': 'The Guardian',
'timeAgo': '20 saat önce'},
.
.
.
Vaccine Subjectivity: 0.34522727272727277 Sentiment: 0.14404040404040402
[{'description': '10 Cool Tech Gadgets To Survive Working From Home. From '
'Wi-Fi and cell phone signal boosters, to noise-cancelling '
'headphones and gadgets...',
'headline': '10 Cool Tech Gadgets To Survive Working From Home',
'source': 'CRN',
'timeAgo': '2 gün önce'},
{'description': 'Over the past few years, smart home products have dominated '
'the gadget space, with goods ranging from innovative updates '
'to the items we...',
'headline': '6 Smart Home Gadgets That Are Actually Worth Owning',
'source': 'Entrepreneur',
'timeAgo': '2 hafta önce'},
.
.
.
Home Gadgets Subjectivity: 0.48007305194805205 Sentiment: 0.3114683441558441
I used headlines and description data to do the operations but you can play with that if you want. You have the data now :)
use this
headline_results = soup.find_all('div', {'class' : 'BNeawe vvjwJb AP7Wnd'})
you already printed the response.text, if you want to find the specific data please search from the response.text result
Try to use select() instead. CSS selectors are more flexible. CSS selectors reference.
Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
If you want to get all titles and so on, then you are looking for this container:
soup.select('.dbsr')
Make sure to pass user-agent, because Google might block your requests eventually and you'll receive a different HTML thus empty output. Check what is your user-agent
Pass user-agent:
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
I'm not sure what exactly are you trying to do but a solution from Guven Degirmenci is a bit overkill as he mentioned, with slicing, regex, doing something in div#main. It's much simpler.
Code and example in the online IDE:
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search?q={self.term}&tbm=nws"
def run (self):
response = requests.get(self.url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
news_data = []
for result in soup.select('.dbsr'):
title = result.select_one('.nDgy9d').text
link = result.a['href']
source = result.select_one('.WF4CUc').text
snippet = result.select_one('.Y3v8qd').text
date_published = result.select_one('.WG9SHc span').text
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Lasagna")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.3255952380952381 Sentiment: 0.05113636363636363
# Lasagna Subjectivity: 0.36556818181818185 Sentiment: 0.25386093073593075
Alternatively, you can achieve the same thing by using Google News Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to maintain the parser, figure out how to parse certain elements or figuring out why something isn't working as it should, and understand how to bypass blocks from Google. All that needs to be done is to iterate over structured JSON and get what you want fast.
Code integrated with your example:
from textblob import TextBlob
import os
from serpapi import GoogleSearch
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search"
def run (self):
params = {
"engine": "google",
"tbm": "nws",
"q": self.url,
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
news_data = []
for result in results['news_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
source = result['source']
date_published = result['date']
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Vaccine")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
# Lasagna Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
P.S - I wrote a bit more detailed blog post about how to scrape Google News.
Disclaimer, I work for SerpApi.
I'm learning how to use BeautifulSoup and I'm trying to read the weather from Google. I'm using this URL.
I'm getting a 'KeyError: "id"' error on the line:
if span.attrs["id"] == "wob_tm":
What does this mean and how can I solve this problem?
I got the same error specifying a different attribute, "class", so I thought it might have just been a problem with the term "class" but I'm still recieving the error no matter what I use
# Creates a list containing all appearences of the 'span' tag
# The weather value is located within a span tag
spans = soup.find_all("span")
for span in spans:
if span.attrs["id"] == "wob_tm":
print(span.content)
I expect the output to be the integer value of the weather but when I run the code I just get:
"KeyError: 'id'"
Some span tags don't have that attribute at all, so they give you the error when you try and access that. You could just refine your search:
spans = soup.find_all('span', {'id': 'wob_tm'})
This would find only objects that match. You can then just print them all:
for span in spans:
print(span.content)
Although the rest of the answers are legit, none will work in that case because the content of temperature is loaded probably using javascript so the spans you're looking won't be found. Instead you can use selenium that works fo sure. i.e.:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.google.co.uk/search?sxsrf=ACYBGNSfZJRq-EqvQ7rSC0oFZW-FiL-S-Q%3A1571602469929&source=hp&ei=JcCsXb-ANoK4kwWgtK_4DQ&q=what%27s+the+weather+today&oq=whats+the+weather+&gs_l=psy-ab.3.0.0i10i70i256j0i10j0j0i10l3j0l3j0i10.663.2962..4144...0.0..0.82.1251.19......0....1..gws-wiz.....10..35i362i39j35i39j0i131.AWESAgn5njA")
temp = driver.find_element_by_id('wob_tm').text
print(temp)
The problem that there is no 'id' key in the dictionary 'attrs'. The code below will handle this case.
spans = soup.find_all("span")
for span in spans:
if span.attrs.get("id") == "wob_tm":
print(span.content)
else:
print('not wob_tm')
Weather data is not rendered with JavaScript as Kostas Charitidis mentioned.
You don't need to specify <span> element, and more over you don't need to use find_all()/findAll()/select() since you're looking just for one element that doesn't repeat anywhere else. Use select_one() instead:
soup.select_one('#wob_tm').text
# prints temperature
You can also use try/except if you want to return None:
try:
temperature = soup.select_one('#wob_tm').text
except: temperature = None
An if statement always costs you, it's nearly free to set up a try/except block. But when an Exception actually occurs, the cost is much higher.
The next problem that might cause that error would be no user-agent specified so Google would block your request eventually thus you'll receive a completely different HTML. I already answered about what is user-agent.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "london weather",
"hl": "en",
"gl": "us"
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Weather condition: {weather_condition}\n'
f'Temperature: {tempature}°F\n'
f'Precipitation: {precipitation}\n'
f'Humidity: {humidity}\n'
f'Wind speed: {wind}\n'
f'Current time: {current_time}\n')
----
'''
Weather condition: Mostly cloudy
Temperature: 60°F
Precipitation: 3%
Humidity: 77%
Wind speed: 3 mph
Current time: Friday 7:00 AM
'''
Alternatively, you can achieve this by using the Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to extract elements since it's already done for the end-user and no need to maintain a parser over time. All that needs to be done is just to iterate over structured JSON and get what you were looking for.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "london weather",
"api_key": os.getenv("API_KEY"),
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
print(f'{loc}\n{weather_date}\n{weather}\n{temp}°F\n{precipitation}\n{humidity}\n{wind}\n')
-------
'''
District 3
Friday
Mostly sunny
80°F
0%
52%
5 mph
'''
Disclaimer, I work for SerpApi.