Cannot get a CSS class from Google search page - python

I use BeautifulSoup for parsing a Google search, but I get empty list. I want to make a spellchecker by using Google's "Did you mean?".
import requests
from bs4 import BeautifulSoup
import urllib.parse
text = "i an you ate goode maan"
data = urllib.parse.quote_plus(text)
url = 'https://translate.google.com/?source=osdd#view=home&op=translate&sl=auto&tl=en&text='
rq = requests.get(url + data)
soup = BeautifulSoup(rq.content, 'html.parser')
words = soup.select('.tlid-spelling-correction spelling-correction gt-spell-correct-message')
print(words)
The output is just: [], but expected: "i and you are good man" (sorry for such a bad text example)

First, the element you are looking for is loaded using javascript. Since BeautifulSoup does not run js, the target elements don't get loaded into the DOM hence the query selector can't find them. Try using Selenium instead of BeautifulSoup.
Second, The CSS selector should be
.tlid-spelling-correction.spelling-correction.gt-spell-correct-message`.
Notice the . instead of space in front of every class name.
I have verified it using JS query selector
The selector you were using .tlid-spelling-correction spelling-correction gt-spell-correct-message was looking for an element with class gt-spell-correct-message inside an element with class spelling-correction which itself was inside another element with class tlid-spelling-correction.
By removing the space and putting a dot in front of every class name, the selector looks for an element with all three of the above mentioned classes.

Related

Web scrape data in performance gauge

I'm trying to scrape data in a widget using python and the requests-html library.
The the value I want is in a gauge with an arrow pointing to five possible results.
Each label on the gauge is the same on all pages of the website. The problem I face is I cannot use a css selector on the gauge labels to extract the text, I need to extract the value of the arrow itself as it will be pointing to a label. The arrow doesn't have a text attribute so if I use a css selector I get none as a response.
Each arrow has a unique class name.
<div class="arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo arrowStrongBuyShudder-3xsGK8k5">
https://www.tradingview.com/symbols/NASDAQ-MDB/
StrongBuy:
<div class="arrow-F-uE7IX8 arrowToBuy-1R7d8UMJ arrowBuyShudder-3GMCnG5u">
https://www.tradingview.com/symbols/NYSE-XOM/
Buy:
<div class="arrow-F-uE7IX8 arrowToStrongSell-3UWimXJs arrowStrongSellShudder-2UJhm0_C">
https://www.tradingview.com/symbols/NASDAQ-IDEX/
StrongSell:
What can I do to ensure I get the correct value? I'm not sure how I can check if the selector contains the arrowTo{foo} and store as variable.
import pyppdf.patch_pyppeteer
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
async def get_page():
code = 'NASDAQ-MDB'
r = await asession.get(f'https://www.tradingview.com/symbols/{code}/')
await r.html.arender(wait=3)
return r
results = asession.run(get_page)
for result in results:
arrow_class_placeholder = "//div[contains(#class,'arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo')]//div[1]"
arrow_class_name = result.html.xpath(arrow_class_placeholder,first=True)
if arrow_class_name == "//div[contains(#class,'arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo')]//div[1]":
print('StrongBuy')
else:
print('not strong buy')
You can use BeautifulSoup4 (bs4), which is a a Python library for pulling data out of HTML and XML files, with a combination of Regular Expressions (RegEx). In this case I used the python re library for the RegEx purposes.
Something like this is what you want (source):
In the example above soup.find_all(class_=re.compile("itle")) returns all instances where the word "itle" is found in the class tag, such as class = "title" from the html document shown below.
For your RegEx it would look something like "arrowTo*" or even just "arrowTo". soup.find_all(class_=re.compile("arrowTo")).
Your final code should look something like:
from bs4 import BeautifulSoup
#i think result was your html document from requests library
#the first parameter is your html document variable
soup = BeautifulSoup(result, 'html.parser')
myArrowToList = soup.find_all(class_=re.compile("arrowTo"))
If you wanted "arrowToStrongBuy" just use that in the regex input to the find_all function.
soup.find_all(class_=re.compile("arrowToStrongBuy"))

Getting weather for a country, place bs4

I'm trying to use this website https://www.timeanddate.com/weather/ to scrape data of the weather using BeautifulSoup4 by opening a URL as:
quote_page=r"https://www.timeanddate.com/weather/%s/%s/ext" %(country, place)
I'm still new to web scraping methods and BS4, I can find the information I need in the source of the page (for example, we take country as India and city as Mumbai in this search) linked as: https://www.timeanddate.com/weather/india/mumbai/ext
If you see the page's source, it is not difficult to use CTRL+F and find the attributes of the information like "Humidity", "Dew Point" and current state of the weather (if it's clear, rainy, etc.), the only thing that's preventing me from getting those data is my knowledge of BS4.
Can you inspect the page source and write the BS4 methods to get information like
"Feels Like:", "Visibility", "Dew Point", "Humidity", "Wind" and "Forecast"?
Note: I've done a data scraping exercise before where I had to get the value in an HTML tag like <tag class="someclass">value</tag>
using
`
a=BeautifulSoup.find(tag, attrs={'class':'someclass'})
a=a.text.strip()`
You could familiarize yourself with css selectors
import requests
from bs4 import BeautifulSoup as bs
country = 'india'
place = 'mumbai'
headers = {'User-Agent' : 'Mozilla/5.0',
'Host' : 'www.timeanddate.com'}
quote_page= 'https://www.timeanddate.com/weather/{0}/{1}'.format(country, place)
res = requests.get(quote_page)
soup = bs(res.content, 'lxml')
firstItem = soup.select_one('#qlook p:nth-of-type(2)')
strings = [string for string in firstItem.stripped_strings]
feelsLike = strings[0]
print(feelsLike)
quickFacts = [item.text for item in soup.select('#qfacts p')]
for fact in quickFacts:
print(fact)
The first selector #qlook p:nth-of-type(2) uses an id selector to specify the parent then an :nth-of-type CSS pseudo-class to select the second paragraph type element (p tag) within.
That selector matches:
I use stripped_strings to separate out the individual lines and access required info by index.
The second selector #qfacts p uses an id selector for the parent element and then a descendant combinator with p type selector to specify child p tag elements. That combination matches the following:
quickFacts represent a list of those matches. You can access items by index.

Python scrape table

I'm new to programming so it's very likely my idea of doing what I'm trying to do is totally not the way to do that.
I'm trying to scrape standings table from this site - http://www.flashscore.com/hockey/finland/liiga/ - for now it would be fine if I could even scrape one column with team names, so I try to find td tags with the class "participant_name col_participant_name col_name" but the code returns empty brackets:
import requests
from bs4 import BeautifulSoup
import lxml
def table(url):
teams = []
source = requests.get(url).content
soup = BeautifulSoup(source, "lxml")
for td in soup.find_all("td"):
team = td.find_all("participant_name col_participant_name col_name")
teams.append(team)
print(teams)
table("http://www.flashscore.com/hockey/finland/liiga/")
I tried using tr tag to retrieve whole rows, but no success either.
I think the main problem here is that you are trying to scrape a dynamically generated content using requests, note that there's no participant_name col_participant_name col_name text at all in the HTML source of the page, which means this is being generated with JavaScript by the website. For that job you should use something like selenium together with ChromeDriver or the driver that you find better, below is an example using both of the mentioned tools:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://www.flashscore.com/hockey/finland/liiga/"
driver = webdriver.Chrome()
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source, "lxml")
elements = soup.findAll('td', {'class':"participant_name col_participant_name col_name"})
I think another issue with your code is the way you were trying to access the tags, if you want to match a specific class or any other specific attribute you can do so using a Python's dictionary as an argument of .findAll function.
Now we can use elements to find all the teams' names, try print(elements[0]) and notice that the team's name is inside an a tag, we can access it using .a.text, so something like this:
teams = []
for item in elements:
team = item.a.text
print(team)
teams.append(team)
print(teams)
teams now should be the desired output:
>>> teams
['Assat', 'Hameenlinna', 'IFK Helsinki', 'Ilves', 'Jyvaskyla', 'KalPa', 'Lukko', 'Pelicans', 'SaiPa', 'Tappara', 'TPS Turku', 'Karpat', 'KooKoo', 'Vaasan Sport', 'Jukurit']
teams could also be created using list comprehension:
teams = [item.a.text for item in elements]
Mr Aguiar beat me to it! I will just point out that you can do it all with selenium alone. Of course he is correct in pointing out that this is one of the many sites that loads most of its content dynamically.
You might be interested in observing that I have used an xpath expression. These often make for compact ways of saying what you want. Not too hard to read once you get used to them.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('http://www.flashscore.com/hockey/finland/liiga/')
>>> items = driver.find_elements_by_xpath('.//span[#class="team_name_span"]/a[text()]')
>>> for item in items:
... item.text
...
'Assat'
'Hameenlinna'
'IFK Helsinki'
'Ilves'
'Jyvaskyla'
'KalPa'
'Lukko'
'Pelicans'
'SaiPa'
'Tappara'
'TPS Turku'
'Karpat'
'KooKoo'
'Vaasan Sport'
'Jukurit'
You're very close.
Start out being a little less ambitious, and just focus on "participant_name". Take a look at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all . I think you want something like:
for td in soup.find_all("td", "participant_name"):
Also, you must be seeing different web content than I am. After a wget of your URL, grep doesn't find "participant_name" in the text at all. You'll want to verify that your code is looking for an ID or a class that is actually present in the HTML text.
Achieving the same using css selector which will let you make the code more readable and concise:
from selenium import webdriver; driver = webdriver.Chrome()
driver.get('http://www.flashscore.com/hockey/finland/liiga/')
for player_name in driver.find_elements_by_css_selector('.participant_name'):
print(player_name.text)
driver.quit()

Web Scraping with Python - Looping for city name, clicking and get interested value

This is my first time with Python and web scraping. Have been looking around and still unable to get what I need to do.
Below are print screen of the elements that I've used via Chrome.
As you can see, it is from the dropdown 'Apartments'.
My 1st step in trying to do is get the list of cities from the drop down
My 2nd step is then, from the given city list, go to each of them (...url.../Brantford/ for example)
My 3rd step is then, given the available apartments, click each of the available apartments to get the price range for each bedroom type
Currently, I am JUST trying to 'loop' through the cities in the first step and it's not working.
Could you please help me out as well, if there's a good forum, article, tutorial etc that's good for beginner like me to read and learn. I'd really like to be good in this so that I may give me to society one day.
Thank you!
import requests
from bs4 import BeautifulSoup
url = 'http://www.homestead.ca/apartments-for-rent/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.find(".child-pages dropdown-menu a href")
print (dropdown_list.prettify())
Screenshot
You can access the elements by the class and a child "a" node. Then access the attribute "href" and add a domain name.
import requests
from bs4 import BeautifulSoup
url = 'http://www.homestead.ca/apartments-for-rent/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.select(".primary .child-pages a")
links=['http://www.homestead.ca'+x['href'] for x in dropdown_list]
print (links)
city_names=[x.text for x in dropdown_list]
print (city_names)
result=[]
for link in links:
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html,'lxml')
...
result.append(...)
Explanation:
soup.select(".primary .child-pages a")
Using CSS selector I select the "a" nodes that are children of a node with the class "child-pages" which is a child of the the node with a class "primary". There were two nodes with class "child-pages" and I filtered one that was under node with "primary" class.
[x.text for x in dropdown_list]
This is a list comprehension in Python. It means that I choose all elements of dropdown_list and then take only the attribute text of each of them and then return as a list.
You can then iterate over the links and append the data to a list (here "result").
I found this introduction to BeautifulSoup pretty good though without going through the links: http://programminghistorian.org/lessons/intro-to-beautiful-soup
I would also recommend reading a book. For example this one: Web Scraping with Python: Collecting Data from the Modern Web

Is this way to get items from a tag which has 2 class attributes with BeautifulSoup correct?

I'd like to get items from a website with BeautifulSoup.
<div class="post item">
The target tag is this.
The tag has two attrs and white space.
First, I wrote,
roots = soup.find_all("div", "post item")
But, it didn't work.
Then I wrote,
html.find_all("div", {'class':['post', 'item']})
I could get items with this,but I am nost sure if this is correct or not.
is this code correct?
//// Additional ////
I am sorry,
html.find_all("div", {'class':['post', 'item']})
didn't work properly.
It also extracts class="item".
And, I had to write,
soup.find_all("div", class_="post item")
not = but _=. Although this doesn't work for me...(>_<)
Target url:
https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb
mycode:
#!/usr/bin/env python
# -*- coding: utf-8 -*-
from urllib.request import urlopen
from bs4 import BeautifulSoup
def main():
target = "https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb"
html = urlopen(target)
soup = BeautifulSoup(html, "html.parser")
roots = soup.find_all("div", class_="post item")
print(roots)
for root in roots:
print("##################")
if __name__ == '__main__':
main()
You could use a css select:
soup.select("div.post.item")
Or use class_
.find_all("div", class_="post item")
The docs suggest that *If you want to search for tags that match two or more CSS classes, you should use a CSS selector as per the first example.
The give example of both uses:
You can also search for the exact string value of the class attribute:
css_soup.find_all("p", class_="body strikeout")
# [<p class="body strikeout"></p>]
If you want to search for tags that match two or more CSS classes, you should use a CSS selector:
css_soup.select("p.strikeout.body")
# [<p class="body strikeout"></p>]
Why your code fails why and any of the above solutions would fail has more to do with the fact the class does not exist in the source, it it were there they would all work:
In [6]: r = requests.get("https://flipboard.com/section/%E3%83%8B%E3%83%A5%E3%83%BC%E3%82%B9-3uscfrirj50pdtqb")
In [7]: cont = r.content
In [8]: "post item" in cont
Out[8]: False
If you look at the browser source and do a search you won't find it either. It is generated dynamically and can only be seen if you crack open a developer console or firebug. They also only contain some styling and a react ids so not sure what you expect to pull from it even if you did get them.
If you want to get the html that you see in the browser, you will need something like selenium
First of all, note that class is a very special multi-valued attribute and it is a common source of confusion in BeautifulSoup.
html.find_all("div", {'class':['post', 'item']})
This would find all div elements that have either post class or item class (or both, of course). This may produce extra results you don't want to see, assuming you are after div elements with strictly class="post item". If this is the case, you can use a CSS selector:
html.select('div[class="post item"]')
There is also some more information in a similar thread:
BeautifulSoup returns empty list when searching by compound class names

Categories