How to fix 'KeyError' error in BeautifulSoup - python

I'm learning how to use BeautifulSoup and I'm trying to read the weather from Google. I'm using this URL.
I'm getting a 'KeyError: "id"' error on the line:
if span.attrs["id"] == "wob_tm":
What does this mean and how can I solve this problem?
I got the same error specifying a different attribute, "class", so I thought it might have just been a problem with the term "class" but I'm still recieving the error no matter what I use
# Creates a list containing all appearences of the 'span' tag
# The weather value is located within a span tag
spans = soup.find_all("span")
for span in spans:
if span.attrs["id"] == "wob_tm":
print(span.content)
I expect the output to be the integer value of the weather but when I run the code I just get:
"KeyError: 'id'"

Some span tags don't have that attribute at all, so they give you the error when you try and access that. You could just refine your search:
spans = soup.find_all('span', {'id': 'wob_tm'})
This would find only objects that match. You can then just print them all:
for span in spans:
print(span.content)

Although the rest of the answers are legit, none will work in that case because the content of temperature is loaded probably using javascript so the spans you're looking won't be found. Instead you can use selenium that works fo sure. i.e.:
from selenium import webdriver
driver = webdriver.Chrome('chromedriver.exe')
driver.get("https://www.google.co.uk/search?sxsrf=ACYBGNSfZJRq-EqvQ7rSC0oFZW-FiL-S-Q%3A1571602469929&source=hp&ei=JcCsXb-ANoK4kwWgtK_4DQ&q=what%27s+the+weather+today&oq=whats+the+weather+&gs_l=psy-ab.3.0.0i10i70i256j0i10j0j0i10l3j0l3j0i10.663.2962..4144...0.0..0.82.1251.19......0....1..gws-wiz.....10..35i362i39j35i39j0i131.AWESAgn5njA")
temp = driver.find_element_by_id('wob_tm').text
print(temp)

The problem that there is no 'id' key in the dictionary 'attrs'. The code below will handle this case.
spans = soup.find_all("span")
for span in spans:
if span.attrs.get("id") == "wob_tm":
print(span.content)
else:
print('not wob_tm')

Weather data is not rendered with JavaScript as Kostas Charitidis mentioned.
You don't need to specify <span> element, and more over you don't need to use find_all()/findAll()/select() since you're looking just for one element that doesn't repeat anywhere else. Use select_one() instead:
soup.select_one('#wob_tm').text
# prints temperature
You can also use try/except if you want to return None:
try:
temperature = soup.select_one('#wob_tm').text
except: temperature = None
An if statement always costs you, it's nearly free to set up a try/except block. But when an Exception actually occurs, the cost is much higher.
The next problem that might cause that error would be no user-agent specified so Google would block your request eventually thus you'll receive a completely different HTML. I already answered about what is user-agent.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "london weather",
"hl": "en",
"gl": "us"
}
response = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(response.text, 'lxml')
weather_condition = soup.select_one('#wob_dc').text
tempature = soup.select_one('#wob_tm').text
precipitation = soup.select_one('#wob_pp').text
humidity = soup.select_one('#wob_hm').text
wind = soup.select_one('#wob_ws').text
current_time = soup.select_one('#wob_dts').text
print(f'Weather condition: {weather_condition}\n'
f'Temperature: {tempature}°F\n'
f'Precipitation: {precipitation}\n'
f'Humidity: {humidity}\n'
f'Wind speed: {wind}\n'
f'Current time: {current_time}\n')
----
'''
Weather condition: Mostly cloudy
Temperature: 60°F
Precipitation: 3%
Humidity: 77%
Wind speed: 3 mph
Current time: Friday 7:00 AM
'''
Alternatively, you can achieve this by using the Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to figure out how to extract elements since it's already done for the end-user and no need to maintain a parser over time. All that needs to be done is just to iterate over structured JSON and get what you were looking for.
Code to integrate:
from serpapi import GoogleSearch
import os
params = {
"engine": "google",
"q": "london weather",
"api_key": os.getenv("API_KEY"),
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
loc = results['answer_box']['location']
weather_date = results['answer_box']['date']
weather = results['answer_box']['weather']
temp = results['answer_box']['temperature']
precipitation = results['answer_box']['precipitation']
humidity = results['answer_box']['humidity']
wind = results['answer_box']['wind']
print(f'{loc}\n{weather_date}\n{weather}\n{temp}°F\n{precipitation}\n{humidity}\n{wind}\n')
-------
'''
District 3
Friday
Mostly sunny
80°F
0%
52%
5 mph
'''
Disclaimer, I work for SerpApi.

Related

How to get all page results - Web Scraping - Pagination

I am a beginner in regards to coding. Right now I am trying to get a grip on simple web scrapers using python.
I want to scrape a real estate website and get the Title, price, sqm, and what not into a CSV file.
My questions:
It seems to work for the first page of results but then it repeats and it does not run through the 40 pages. It rather fills the file with the same results.
The listings have info about "square meter" and the "number of rooms". When I inspect the page it seems though that it uses the same class for both elements. How would I extract the room numbers for example?
Here is the code that I have gathered so far:
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.63 Safari/537.36'}
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={1}'
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
return soup
def transform(soup):
divs = soup.find_all('div', class_ = 'col-xs-12 place-over-understitial sel-bg-gray-lighter')
for item in divs:
title = item.find('div', {'class': 'text-225'}).text.strip().replace('\n', '')
title2 = title.replace('\t', '')
hausart = item.find('span', class_ = 'text-100').text.strip().replace('\n', '')
hausart2 = hausart.replace('\t', '')
try:
price = item.find('span', class_ = 'text-250 text-strong text-nowrap').text.strip()
except:
price = 'Auf Anfrage'
wohnflaeche = item.find('p', class_ = 'text-250 text-strong text-nowrap').text.strip().replace('m²', '')
angebot = {
'title': title2,
'hausart': hausart2,
'price': price
}
hauslist.append(angebot)
return
hauslist=[]
for i in range(0, 40):
print(f'Getting page {i}...')
c = extract(i)
transform(c)
df = pd.DataFrame(hauslist)
print(df.head())
df.to_csv('immonetHamburg.csv')
This is my first post on stackoverflow so please be kind if I should have posted my problem differently.
Thanks
Pat
You have stupid mistake.
In url you have to use {page} instead of {1}. That's all.
url = f'https://www.immonet.de/immobiliensuche/sel.do?suchart=2&city=109447&marketingtype=1&pageoffset=1&radius=0&parentcat=2&sortby=0&listsize=26&objecttype=1&page={page}'
I see other problem:
You start scraping at page 0 but servers often give the same result for page 0 and 1.
You should use range(1, ...) instead of range(0, ...)
As for searching elements.
Beautifulsoup may search not only classes but also id and any other value in tag - ie. name, style, data, etc. It can also search by text "number of rooms". It can also use regex for this. You can also assign own function which will check element and return True/False to decide if it has to keep it in results.
You can also combine .find() with another .find() or .find_all().
price = item.find('div', {"id": lambda value:value and value.startswith('selPrice')}).find('span')
if price:
print("price:", price.text)
And if you know that "square meter" is before "number of rooms" then you could use find_all() to get both of them and later use [0] to get first of them and [1] to get second of them.
You should read all documentation beacause it can be very useful.
I advice you use Selenium instead, because you can physically click the 'next-page' button until you cover all pages and the whole code will only take a few lines.
As #furas mentioned you have a mistake with the page.
To get all rooms you need to find_all and get the last index with -1. Because sometimes there are 3 items or 2.
#to remote all \n and \r
translator = str.maketrans({chr(10): '', chr(9): ''})
rooms = item.find_all('p', {'class': 'text-250'})
if rooms:
rooms = rooms[-1].text.translate(translator).strip()

Trouble scraping weather data from Google

I'm writing a program that will scrape wind speed and direction data from Google. I've seen other results online where it works out fine, but for some reason, it's not working out for me. I am specifically interested in scraping the elements with "img" tags. Here is my code:
import requests
import bs4
import geocoder
lat, long = 40.776903698619975, -74.45007646247723
base_url = r"https://www.google.com/search?q="
geoc = geocoder.osm([lat, long], method='reverse').json["raw"]["address"]
search_query = geoc["state"] + " " + geoc["country"] + " wind conditions"
lowest_admin_levels = ("municipality", "town", "city", "county")
level_found = False
for level in lowest_admin_levels:
try:
search_query = geoc[level] + " " + search_query
level_found = True
break
except KeyError:
continue
url = base_url + search_query.replace(" ", "+")
print(url)
page = requests.get(url)
soup = bs4.BeautifulSoup(page.content, 'html.parser')
print(soup.find_all('img'))
The lat/long variables could be any coordinates, those are just examples. soup.find_all('img') returns just one "img" element, when in reality, the page has multiple "img"s containing arrows rotated according to the wind direction, which you can see in this link https://www.google.com/search?q=Morris+Township+New+Jersey+United+States+wind+conditions. Thank you!
As the comment already says, Google loads the images dynamically using JavaScript. The requests library and Beautiful soup are not able to get those JavaScript loaded images. That's why you need Selenium, to get those images.
Installation
pip install selenium
pip install webdriver-manager
Solution
import geocoder
# New imports
import time
from selenium import webdriver
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.chrome.options import Options
lat, long = 40.776903698619975, -74.45007646247723
BASE_URL = r"https://www.google.com/search?q="
geoc = geocoder.osm([lat, long], method='reverse').json["raw"]["address"]
search_query = geoc["state"] + " " + geoc["country"] + " wind conditions"
lowest_admin_levels = ("municipality", "town", "city", "county")
for level in lowest_admin_levels:
try:
search_query = geoc[level] + " " + search_query
break
except KeyError:
continue
url = BASE_URL + search_query.replace(" ", "+")
chrome_options = Options()
# The options make the browser headless, so you don't see it
# comment out those two lines to see whats happening
chrome_options.add_argument("--headless")
chrome_options.add_argument("--window-size=1920x1080")
driver = webdriver.Chrome(ChromeDriverManager().install(), options=chrome_options) # You could specify the path to the chrome driver instead
driver.get(url)
time.sleep(2)
imgs = driver.find_elements_by_tag_name('img') # all the image tags
for img in imgs:
image_source = img.get_attribute('src') # The src of the img tag
print(image_source)
When you remove the headless option, you will see what selenium “sees”. Using Selenium, you can also click around on the website and interact with it, as you would as a normal user.
It's doesn't require geocoder nor selenium. Check out SelectorGadget Chrome extension to visually grab CSS selectors by clicking on the desired element.
Also, you can get wind direction from the same element, e.g. class='wob_t' -> area-label:
<span class="wob_t" style="display:inline;text-align:right" aria-label="8 km/h From northwest Tuesday 10:00">8 km/h</span>
Which is the same as in the <img> element (look at alt):
<img src="//ssl.gstatic.com/m/images/weather/wind_unselected.svg" alt="8 km/h From northwest" style="transform-origin:50% 50%;transform:rotate(408deg);width:16px" aria-hidden="true" data-atf="1" data-frt="0" class="">
Code and full example that scrapes more in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
"q": "london weather",
"hl": "en",
"gl": "us"
}
response = requests.get('https://www.google.com/search', headers=headers, params=params).text
soup = BeautifulSoup(response, 'lxml')
for weather_result in soup.select('.wob_noe .wob_hw'):
try:
wind_speed = weather_result.select_one('.wob_t').text
'''
extracts elements, splits the string by a SPACE, and grabs 2nd and 4th index,
and then joins via SPACE. Or just use regex instead.
Example:
7 mph From northwest Sunday 9:00 AM ---> From northeast
'''
wind_direction = ' '.join(weather_result.select_one('.wob_t')['aria-label'].split(' ')[2:4])
print(f"{wind_speed}\n{wind_direction}\n")
except:
pass # or None instead
----------
'''
8 mph
From northeast
11 mph
From east
9 mph
From northeast
...
'''
Alternatively, you can use Google Direct Answer Box API from SerpApi. It's a paid API with a free plan.
Essentially, you don't need to figure out extracting part of the process and all that really needs to be done is just to iterate over structured JSON and use whatever you need from it (apart from bypass blocks from Google or maintain the parser over time).
Code to integrate:
from serpapi import GoogleSearch
import os, json
params = {
"engine": "google",
"q": "london weather",
"api_key": os.getenv("API_KEY"),
"hl": "en",
"gl": "us",
}
search = GoogleSearch(params)
results = search.get_dict()
forecast = results['answer_box']['forecast']
print(json.dumps(forecast, indent=2))
----------
'''
[
{
"day": "Tuesday",
"weather": "Partly cloudy",
"temperature": {
"high": "72",
"low": "57"
},
"thumbnail": "https://ssl.gstatic.com/onebox/weather/48/partly_cloudy.png"
}
...
]
'''
Disclaimer, I work for SerpApi.

Facing issue at the time of Web scraping

I am trying to extract reviews from Glass door. However I am facing issues. Please follow my codes below-
import requests
from bs4 import BeautifulSoup
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
url = requests.get("https://www.glassdoor.co.in/Reviews/The-Wonderful-Company-Reviews-E1005987.htm?sort.sortType=RD&sort.ascending=false&countryRedirect=true", headers=headers)
urlContent =BeautifulSoup(url.content,"lxml")
print(urlContent)
review = urlContent.find_all('a',class_='reviewLink')
review
title = []
for i in range(0,len(review)):
title.append(review[i].get_text())
title
rating= urlContent.find_all('div',class_='v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small')
score=[]
for i in range(0,len(rating)):
score.append(rating[i].get_text())
rev_pros=urlContent.find_all("span",{"data-test":"pros"})
pros=[]
for i in range(0,len(rev_pros)):
pros.append(rev_pros[i].get_text())
pros
rev_cons=urlContent.find_all("span",{"data-test":"cons"})
cons=[]
for i in range(0,len(rev_cons)):
cons.append(rev_cons[i].get_text())
cons
advse=urlContent.find_all("span",{"data-test":"advice-management"})
advse
advise=[]
for i in range(0,len(advse)):
advise.append(advse[i].get_text())
advise
location=urlContent.find_all('span',class_='authorLocation')
location
job_location=[]
for i in range(0,len(location)):
job_location.append(location[i].get_text())
job_location
import pandas as pd
df=pd.DataFrame()
df['Review Title']=title
df['Overall Score']=score
df['Pros']=pros
df['Cons']=cons
df['Jobs_Location']=job_location
df['Advise to Mgmt']=advise
Here I am facing two challenges-
Unable to extract anything for 'advse'(used for 'Advise to
Managemnt').
Getting error when I use 'Job Location' as a column in the data
frame.(ValueError: Length of values does not match length of index).
For this error my finding was- there were ten rows for
other columns however for 'Job Location' there are less rows as
location not disclosed in some reviews.
Can any body help me on this. Thanks in advance.
A better approach would be to find a <div> that encloses each of the reviews and then extract all the information needed from it before moving to the next. This would make it easier to deal with the case where information is missing in some reviews.
For example:
import requests
from bs4 import BeautifulSoup
import pandas as pd
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
url = requests.get("https://www.glassdoor.co.in/Reviews/The-Wonderful-Company-Reviews-E1005987.htm?sort.sortType=RD&sort.ascending=false&countryRedirect=true", headers=headers)
urlContent = BeautifulSoup(url.content,"lxml")
get_text = lambda x: x.get_text(strip=True) if x else ""
entries = []
for entry in urlContent.find_all('div', class_='row mt'):
review = entry.find('a', class_="reviewLink")
rating = entry.find('div',class_='v2__EIReviewsRatingsStylesV2__ratingNum v2__EIReviewsRatingsStylesV2__small')
rev_pros = entry.find("span", {"data-test":"pros"})
rev_cons = entry.find("span", {"data-test":"cons"})
location = entry.find('span', class_='authorLocation')
advice = entry.find("span", {"data-test":"advice-management"})
entries.append([
get_text(review),
get_text(rating),
get_text(rev_pros),
get_text(rev_cons),
get_text(location),
get_text(advice)
])
columns = ['Review Title', 'Overall Score', 'Pros', 'Cons', 'Jobs_Location', 'Advise to Mgmt']
df = pd.DataFrame(entries, columns=columns)
print(df)
The get_text() function ensures that if nothing was returned (i.e. None) then an empty string is returned.
You will need to improve your logic for extracting the advice. The information for the whole page is held inside a <script> tag. One of them holds the JSON data. The advice information is not moved into HTML until a user clicks on it, as such it would need to be extracted from the JSON. If this approach is used, then it would also make sense to extract all of the other information also directly from the JSON.
To do this, locate all the <script> tags and determine which contains the reviews. Convert the JSON into a Python data structure (using the JSON library). Now locate the reviews, for example:
import requests
from bs4 import BeautifulSoup
import json
import pandas as pd
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:52.0) Gecko/20100101 Firefox/52.0',
})
url = requests.get("https://www.glassdoor.co.in/Reviews/The-Wonderful-Company-Reviews-E1005987.htm?sort.sortType=RD&sort.ascending=false&countryRedirect=true", headers=headers)
urlContent = BeautifulSoup(url.content,"lxml")
entries = []
for script in urlContent.find_all('script'):
text = script.text
if "appCache" in text:
# extract the JSON from the script tag
data = json.loads(text[text.find('{'): text.rfind('}') + 1])
# Go through all keys in the dictionary and pick those containing reviews
for key, value in data['apolloState'].items():
if ".reviews." in key and "links" not in key:
location = value['location']
city = location['id'] if location else None
entries.append([
value['summary'],
value['ratingOverall'],
value['pros'],
value['cons'],
city,
value['advice']
])
columns = ['Review Title', 'Overall Score', 'Pros', 'Cons', 'Jobs_Location', 'Advise to Mgmt']
df = pd.DataFrame(entries, columns=columns)
print(df)
This would give you a dataframe as follows:
Review Title Overall Score Pros Cons Jobs_Location Advise to Mgmt
0 Upper management n... 3 Great benefits, lo... Career advancement... City:1146821 Listen to your emp...
1 Sales 2 Good atmosphere lo... Drive was very far... None None
2 As an organization... 2 Free water and goo... Not a lot of diver... None None
3 Great place to grow 4 If your direct man... Owners are heavily... City:1146821 None
4 Great Company 5 Great leadership, ... To grow and move u... City:1146821 None
5 Lots of opportunit... 5 This is a fast pac... There's a sense of... City:1146821 Continue listening...
6 Interesting work i... 3 Working with great... High workload and ... None None
7 Wonderful 5 This company care... The drive, but we ... City:1146577 Continue growing y...
8 Horrendous 1 The pay was fairly... Culture of abuse a... City:1146821 Upper management l...
9 Upper Leadership a... 1 Strong Company, fu... You don't have a f... City:1146577 You get rid of fol...
It would help if you added print(data) to see the whole structure of the data being returned. The only issue with this approach is a further lookup would be needed to convert the city ID into an actual location. That information is also contained in the JSON.

Tag of Google news title for beautiful soup

I am trying to extract the result of a search from Google news (vaccine for example) and provide some sentiment analysis based on the headline collected.
So far, I can't seem to find the correct tag to collect the headlines.
Here is my code:
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
headline_results = soup.find_all('div', class_="phYMDf nDgy9d")
for h in headline_results:
blob = TextBlob(h.get_text())
self.sentiment += blob.sentiment.polarity / len(headline_results)
self.subjectivity += blob.sentiment.subjectivity / len(headline_results)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
The result are always 0 for the sentiment and 0 for the subjectivity. I feel like the issue is with the class_="phYMDf nDgy9d".
If you browse into that link, you are going to see the finished state of page but requests.get does not exeute or load any more data other than the page you request. Luckily there is some data and you can scrape that. I suggest you to use html prettifier services like codebeautify to get better understanding about what the page structure is.
Also if you see classes like phYMDf nDgy9d be sure to avoid finding with them. They are minified versions of classes so at any moment if they change a part of the CSS code, the class you are looking for is going to get a new name.
What I did is probably overkill but, I managed to dig down to scrape specific parts and your code works now.
When you look at the prettier version of requested html file, necessary contents are in a div with an id of main shown above. Then it's children are starting with a div element Google Search, continuing with a style element and after one empty div element, there are post div elements. The last two elements in that children list are footer and script elements. We can cut these off with [3:-2] and then under that tree we have pure data (pretty much). If you check the remaining part of the code after the posts variable, you can understand it I think.
Here is the code:
from textblob import TextBlob
import requests, re
from bs4 import BeautifulSoup
from pprint import pprint
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = 'https://www.google.com/search?q={0}&source=lnms&tbm=nws'.format(self.term)
def run (self):
response = requests.get(self.url)
#print(response.text)
soup = BeautifulSoup(response.text, 'html.parser')
mainDiv = soup.find("div", {"id": "main"})
posts = [i for i in mainDiv.children][3:-2]
news = []
for post in posts:
reg = re.compile(r"^/url.*")
cursor = post.findAll("a", {"href": reg})
postData = {}
postData["headline"] = cursor[0].find("div").get_text()
postData["source"] = cursor[0].findAll("div")[1].get_text()
postData["timeAgo"] = cursor[1].next_sibling.find("span").get_text()
postData["description"] = cursor[1].next_sibling.find("span").parent.get_text().split("· ")[1]
news.append(postData)
pprint(news)
for h in news:
blob = TextBlob(h["headline"] + " "+ h["description"])
self.sentiment += blob.sentiment.polarity / len(news)
self.subjectivity += blob.sentiment.subjectivity / len(news)
a = Analysis('Vaccine')
a.run()
print(a.term, 'Subjectivity: ', a.subjectivity, 'Sentiment: ' , a.sentiment)
A few outputs:
[{'description': 'It comes after US health officials said last week they had '
'started a trial to evaluate a possible vaccine in Seattle. '
'The Chinese effort began on...',
'headline': 'China embarks on clinical trial for virus vaccine',
'source': 'The Star Online',
'timeAgo': '5 saat önce'},
{'description': 'Hanneke Schuitemaker, who is leading a team working on a '
'Covid-19 vaccine, tells of the latest developments and what '
'needs to be done now.',
'headline': 'Vaccine scientist: ‘Everything is so new in dealing with this '
'coronavirus’',
'source': 'The Guardian',
'timeAgo': '20 saat önce'},
.
.
.
Vaccine Subjectivity: 0.34522727272727277 Sentiment: 0.14404040404040402
[{'description': '10 Cool Tech Gadgets To Survive Working From Home. From '
'Wi-Fi and cell phone signal boosters, to noise-cancelling '
'headphones and gadgets...',
'headline': '10 Cool Tech Gadgets To Survive Working From Home',
'source': 'CRN',
'timeAgo': '2 gün önce'},
{'description': 'Over the past few years, smart home products have dominated '
'the gadget space, with goods ranging from innovative updates '
'to the items we...',
'headline': '6 Smart Home Gadgets That Are Actually Worth Owning',
'source': 'Entrepreneur',
'timeAgo': '2 hafta önce'},
.
.
.
Home Gadgets Subjectivity: 0.48007305194805205 Sentiment: 0.3114683441558441
I used headlines and description data to do the operations but you can play with that if you want. You have the data now :)
use this
headline_results = soup.find_all('div', {'class' : 'BNeawe vvjwJb AP7Wnd'})
you already printed the response.text, if you want to find the specific data please search from the response.text result
Try to use select() instead. CSS selectors are more flexible. CSS selectors reference.
Have a look at SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser.
If you want to get all titles and so on, then you are looking for this container:
soup.select('.dbsr')
Make sure to pass user-agent, because Google might block your requests eventually and you'll receive a different HTML thus empty output. Check what is your user-agent
Pass user-agent:
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
requests.get("YOUR_URL", headers=headers)
I'm not sure what exactly are you trying to do but a solution from Guven Degirmenci is a bit overkill as he mentioned, with slicing, regex, doing something in div#main. It's much simpler.
Code and example in the online IDE:
from textblob import TextBlob
import requests
from bs4 import BeautifulSoup
headers = {
"User-agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search?q={self.term}&tbm=nws"
def run (self):
response = requests.get(self.url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
news_data = []
for result in soup.select('.dbsr'):
title = result.select_one('.nDgy9d').text
link = result.a['href']
source = result.select_one('.WF4CUc').text
snippet = result.select_one('.Y3v8qd').text
date_published = result.select_one('.WG9SHc span').text
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Lasagna")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.3255952380952381 Sentiment: 0.05113636363636363
# Lasagna Subjectivity: 0.36556818181818185 Sentiment: 0.25386093073593075
Alternatively, you can achieve the same thing by using Google News Results API from SerpApi. It's a paid API with a free plan.
The difference in your case is that you don't have to maintain the parser, figure out how to parse certain elements or figuring out why something isn't working as it should, and understand how to bypass blocks from Google. All that needs to be done is to iterate over structured JSON and get what you want fast.
Code integrated with your example:
from textblob import TextBlob
import os
from serpapi import GoogleSearch
class Analysis:
def __init__(self, term):
self.term = term
self.subjectivity = 0
self.sentiment = 0
self.url = f"https://www.google.com/search"
def run (self):
params = {
"engine": "google",
"tbm": "nws",
"q": self.url,
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
news_data = []
for result in results['news_results']:
title = result['title']
link = result['link']
snippet = result['snippet']
source = result['source']
date_published = result['date']
news_data.append({
"title": title,
"link": link,
"source": source,
"snippet": snippet,
"date_published": date_published
})
for h in news_data:
blob = TextBlob(f"{h['title']} {h['snippet']}")
self.sentiment += blob.sentiment.polarity / len(news_data)
self.subjectivity += blob.sentiment.subjectivity / len(news_data)
a = Analysis("Vaccine")
a.run()
print(a.term, "Subjectivity: ", a.subjectivity, "Sentiment: " , a.sentiment)
# Vaccine Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
# Lasagna Subjectivity: 0.30957251082251086 Sentiment: 0.06277056277056277
P.S - I wrote a bit more detailed blog post about how to scrape Google News.
Disclaimer, I work for SerpApi.

XPath getting a specific set of elements within a class

I am scraping Google Scholar and have trouble getting the right XPath expression. When I inspect the wanted elements it returns me expressions like these:
//*[#id="gs_res_ccl_mid"]/div[2]/div[2]/div[3]/a[3]
//*[#id="gs_res_ccl_mid"]/div[3]/div/div[3]/a[3]
// *[#id="gs_res_ccl_mid"]/div[6]/div[2]/div[3]/a[3]
I ended up with the generic expression:
//*[#id="gs_res_ccl_mid"]//a[3]
Also tried the alternative, with similar results:
//*[#id="gs_res_ccl_mid"]/div*/div*/div*/a[3]
The output is something like (I can not post the entire result set because I dont't have 10 points of reputation):
[
'https://scholar.google.es/scholar?cites=5812018205123467454&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/citations?user=EOc3O8AAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/citations?user=nd8O1XQAAAAJ&hl=es&oi=sra',
'https://scholar.google.es/scholar?cites=15483392402856138853&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=7733120668292842687&as_sdt=2005&sciodt=0,5&hl=es',
'https://scholar.google.es/scholar?cites=15761030700327980189&as_sdt=2005&sciodt=0,5&hl=es'
]
The problem with the output is that there are 3 unwanted elements extras and they all have this piece of text citations?user. What can I do to rid me off the unwanted elements?
My code:
def paperOthers(exp,atr=None):
thread = browser.find_elements(By.XPATH,(" %s" % exp))
xArray = []
for t in thread:
if atr == 0:
xThread = t.get_attribute('id')
elif atr == 1:
xThread = t.get_attribute('href')
else:
xThread = t.text
xArray.append(xThread)
return xArray
Which I call with:
rcites = paperOthers("//*[#id='gs_res_ccl_mid']//a[3]", 1)
Change the XPath to exclude the items with text.
rcites = paperOthers("//*[#id='gs_res_ccl_mid']//a[3][not(contains(.,'citations?user'))]",1)
XPath expression could be as simple as //*[#class="gs_fl"]/a[3]/#href:
//* selects all elements in the document until it hits a followed #class.
[#class="gs_fl"] selects element node with gs_fl class attribute.
/a[3] selects the third <a> element that is the child of the gs_fl class element.
/#href selects href attribute of an <a> element.
A w3schools XPath syntax reminder.
Code and full example in the online IDE:
from parsel import Selector
import requests
# https://docs.python-requests.org/en/master/user/quickstart/#passing-parameters-in-urls
params = {
"q": "biology", # search query
"hl": "en" # language
}
# https://docs.python-requests.org/en/master/user/quickstart/#custom-headers
# https://www.whatismybrowser.com/detect/what-is-my-user-agent
# used to act as a "real" user visit
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.87 Safari/537.36",
}
html = requests.get("https://scholar.google.com/scholar", params=params, headers=headers, timeout=30)
selector = Selector(html.text)
for cite_by in selector.xpath('//*[#class="gs_fl"]/a[3]/#href'):
cited_by_link = f"https://scholar.google.com/{cite_by.get()}"
print(cited_by_link)
# output:
"""
https://scholar.google.com//scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com//scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""
Alternatively, you can achieve the same thing by using Google Scholar Organic Results API from SerpApi.
It's a paid API with a free plan that you can use without the need to figure out how to scrape the data and maintain it over time, how to scale it without getting blocked by the search engine, find reliable proxy providers, or CAPTCHA solving services.
Example code to integrate:
from serpapi import GoogleScholarSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # SerpApi API key
"engine": "google_scholar", # scraping search engine
"q": "biology", # search query
"hl": "en" # langugage
}
search = GoogleScholarSearch(params)
results = search.get_dict()
for cited_by in results["organic_results"]:
cited_by_link = cited_by["inline_links"]["cited_by"]["link"]
print(cited_by_link)
# output:
"""
https://scholar.google.com/scholar?cites=775353062728716840&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=1275980731835430123&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9861875288567469852&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=6048612362870884073&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=9716378516521733998&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12429039222112550214&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=12009957625147018103&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=11605101213592406305&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=85936656034523965&as_sdt=2005&sciodt=0,5&hl=en
https://scholar.google.com/scholar?cites=3694569986105898338&as_sdt=2005&sciodt=0,5&hl=en
"""
Disclaimer, I work for SerpApi.

Categories