Python scrape table - python

I'm new to programming so it's very likely my idea of doing what I'm trying to do is totally not the way to do that.
I'm trying to scrape standings table from this site - http://www.flashscore.com/hockey/finland/liiga/ - for now it would be fine if I could even scrape one column with team names, so I try to find td tags with the class "participant_name col_participant_name col_name" but the code returns empty brackets:
import requests
from bs4 import BeautifulSoup
import lxml
def table(url):
teams = []
source = requests.get(url).content
soup = BeautifulSoup(source, "lxml")
for td in soup.find_all("td"):
team = td.find_all("participant_name col_participant_name col_name")
teams.append(team)
print(teams)
table("http://www.flashscore.com/hockey/finland/liiga/")
I tried using tr tag to retrieve whole rows, but no success either.

I think the main problem here is that you are trying to scrape a dynamically generated content using requests, note that there's no participant_name col_participant_name col_name text at all in the HTML source of the page, which means this is being generated with JavaScript by the website. For that job you should use something like selenium together with ChromeDriver or the driver that you find better, below is an example using both of the mentioned tools:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://www.flashscore.com/hockey/finland/liiga/"
driver = webdriver.Chrome()
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source, "lxml")
elements = soup.findAll('td', {'class':"participant_name col_participant_name col_name"})
I think another issue with your code is the way you were trying to access the tags, if you want to match a specific class or any other specific attribute you can do so using a Python's dictionary as an argument of .findAll function.
Now we can use elements to find all the teams' names, try print(elements[0]) and notice that the team's name is inside an a tag, we can access it using .a.text, so something like this:
teams = []
for item in elements:
team = item.a.text
print(team)
teams.append(team)
print(teams)
teams now should be the desired output:
>>> teams
['Assat', 'Hameenlinna', 'IFK Helsinki', 'Ilves', 'Jyvaskyla', 'KalPa', 'Lukko', 'Pelicans', 'SaiPa', 'Tappara', 'TPS Turku', 'Karpat', 'KooKoo', 'Vaasan Sport', 'Jukurit']
teams could also be created using list comprehension:
teams = [item.a.text for item in elements]

Mr Aguiar beat me to it! I will just point out that you can do it all with selenium alone. Of course he is correct in pointing out that this is one of the many sites that loads most of its content dynamically.
You might be interested in observing that I have used an xpath expression. These often make for compact ways of saying what you want. Not too hard to read once you get used to them.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('http://www.flashscore.com/hockey/finland/liiga/')
>>> items = driver.find_elements_by_xpath('.//span[#class="team_name_span"]/a[text()]')
>>> for item in items:
... item.text
...
'Assat'
'Hameenlinna'
'IFK Helsinki'
'Ilves'
'Jyvaskyla'
'KalPa'
'Lukko'
'Pelicans'
'SaiPa'
'Tappara'
'TPS Turku'
'Karpat'
'KooKoo'
'Vaasan Sport'
'Jukurit'

You're very close.
Start out being a little less ambitious, and just focus on "participant_name". Take a look at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all . I think you want something like:
for td in soup.find_all("td", "participant_name"):
Also, you must be seeing different web content than I am. After a wget of your URL, grep doesn't find "participant_name" in the text at all. You'll want to verify that your code is looking for an ID or a class that is actually present in the HTML text.

Achieving the same using css selector which will let you make the code more readable and concise:
from selenium import webdriver; driver = webdriver.Chrome()
driver.get('http://www.flashscore.com/hockey/finland/liiga/')
for player_name in driver.find_elements_by_css_selector('.participant_name'):
print(player_name.text)
driver.quit()

Related

Python (with selenium) how to modify and activate elements to update webpage before scraping data

I am trying to scrape some data from https://marvelsnapzone.com/decks/ but I would like to modify the table of decks before scraping them. For example:
Adding card names:
I am trying to add new div id="tagsblock" with certain names like class="tag card" "Angela "
Executing the "Search":
I would then like to execute the id="searchdecks" command to update the table of decks
Sorting by ascending "Likes":
Lastly I want to edit the span data-sorttype="likes" class to say span data-sorttype="likes" class ="asc"
Below is my current python script which doesn't seem to sort the "Likes" before scraping the deck info. It also currently does not add cards or execute the "Search".
import re
import requests
import os
from bs4 import BeautifulSoup
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
def scrap():
url = 'https://marvelsnapzone.com/decks'
chrome_options = Options()
chrome_options.headless = True
chrome_options.add_argument('--disable-dev-shm-usage')
chrome_options.add_argument('--disable-extensions')
chrome_options.add_argument('--disable-gpu')
browser = webdriver.Chrome(options=chrome_options)
browser.get(url)
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
# I would like to add cards and execute the "Search" option here
selects = soup.findAll('span', {'data-sorttype': 'likes'})
for select in selects:
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
# this does not seem to sort the table, this is based on the data scraped later
links = soup.findAll('a', {'class': 'card cardtooltip maindeckcard tooltiploaded'})
# ... more web-scraping code ...
# I am able to scrape the information after this, but I am struggling to modify the table
# before scraping the information.
if __name__ == '__main__':
characters = scrap()
Usually sites are dynamic and thus load new data via a script when you click on a button. This means that in these cases if you set an attribute with selenium the site will not change.
That said, your code has some errors which I think are caused by the fact that you think selenium and beautifulsoup talk to each other (i.e. interact).
By doing this
soup = BeautifulSoup(...)
browser.execute_script(...)
links = soup.findAll(...)
you are trying to "update" soup by executing a script, but it doesn't work like that, in fact soup is an immutable object. So when you run soup.findAll(...) you are using an "old" soup which doesn't contain the modifications following from browser.execute_script(...).
By doing this
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
you are trying to use selenium to set an attribute of an object found with beautifulsoup. You cannot do this. The correct way is to find the element with selenium
select = browser.find_element(By.CSS_SELECTOR, '[data-sorttype=likes]')
browser.execute_script("arguments[0].setAttribute('class', 'asc')", select)
Anyway this doesn't work because as I said in the beginning, if you set an attribute with selenium the site will not change.
In this code block
selects = soup.findAll('span', {'data-sorttype': 'likes'})
for select in selects:
# do something with select
why doing a loop if you just want to set one attribute? Use soup.find (returns a webelement) instead of soup.findAll (returns a list, in this case a list with only one element)
select = soup.find('span', {'data-sorttype': 'likes'})
# do something with select
So the correct sequence of commands to sort the table and scrape it with beautifulsoup is the following
browser.get(url)
# click on "Likes" button
select = driver.find_element(By.CSS_SELECTOR, '[data-sorttype=likes]')
select.click()
time.sleep(2) # wait to be sure that the table is sorted
html = browser.page_source
soup = BeautifulSoup(html, 'html.parser')
links = soup.findAll('a', {'class': 'card cardtooltip maindeckcard tooltiploaded'})
Notice that beautifulsoup is not mandatory to scrape the page, you can use selenium too.

Can't find class in Java Scripted page using beautifulsoup, selenium, and python

When parsing the following webpage:
https://resultados.registraduria.gov.co/coalicion/0/colombia
only the div "root" is shown with no other classes. I'm trying to retrieve the names of the candidates under the "FilaTablaPartidos__NombreCandidatoConsulta-sc-1s2q8ec-8 uMrIO" class, but, when I try to use the find_all() method from beautiful soup it returns and empty list.
This is the code I'm currently using:
url = 'https://resultados.registraduria.gov.co/coalicion/0/colombia'
driver = webdriver.Chrome('/usr/local/bin/chromedriver')
driver.get(url)
stages = driver.find_elements_by_class_name(\
'FilaTablaPartidos__NombreCandidatoConsulta-sc-1s2q8ec-8 uMrIO')
soup = BeautifulSoup(driver.page_source,'html.parser')
nombres_ptags = soup.find_all('p',{'class':\
'FilaTablaPartidos__NombreCandidatoConsulta-sc-1s2q8ec-8 uMrIO'})
nombres = []
for name in nombres_ptags:
nombres.append(name.text)
driver.close()
I even try to use the driver method find_elements_by_class_name but it still returns as an empty list.
I think the children under the tree apparently aren't showing when parsing the html and I think that is why its not finding the class I need.
Any help is appreciated, thank you!

Python Search multiple html files in a variable

I have used Selenium driver to crawl through many site pages. Every time I get a new page I append the html to a variable called "All_APP_Pages". The variable All_APP_Pages is a variable holding html for many pages. Did not post code because its long and no relevant to issue. Python list "All_APP_Pages" as being of type bytes.
from lxml import html
from lxml import etree
import xml.etree.ElementTree as ET
from selenium.webdriver.common.by import By
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
print(link)
Once all pages have been scanned I need to get the link from this xpath
"//tr[.//span[contains(.,'Product Data Solutions (ABC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
The xpath listed here works. However it only works with the selenium driver if driver is on the page where this link exists. That is why all page are in one variable since I dont know what page the link will be on. The print shows this result
[<Element a at 0x1c39dea1180>]
How do I get this value from link I so can check if value is correct?
You need to iterate the list and get the href value
dom = etree.HTML(All_APP_Pages)
xp = "//tr[.//span[contains(.,'Product Data Solutions (UHC MR)')] and .//td[contains(.,'SQLServer')] and .//td[contains(.,'MR')]]//a"
link = dom.xpath(xp)
hrefs=[l.attrib["href"] for l in link]
print(hrefs)

Get dynamically generated content with python Selenium

This question has been asked before, but I've searched and tried and still can't get it to work. I'm a beginner when it comes to Selenium.
Have a look at: https://finance.yahoo.com/quote/FB
I'm trying to web scrape the "Recommended Rating", which in this case at the time of writing is 2. I've tried:
driver.get('https://finance.yahoo.com/quote/FB')
time.sleep(10)
rating = driver.find_element_by_css_selector('#Col2-4-QuoteModule-Proxy > div > section > div > div > div')
print(rating.text)
...which doesn't give me an error, but doesn't print any text either. I've also tried with xpath, class_name, etc. Instead I tried:
source = driver.page_source
print(source)
This doesn't work either, I'm just getting the actual source without the dynamically generated content. When I click "View Source" in Chrome, it's not there. I tried saving the webpage in chrome. Didn't work.
Then I discovered that if I save the entire webpage, including images and css-files and everything, the source code is different from the one where I just save the HTML.
The HTML-file I get when I save the entire webpage using Chrome DOES contain the information that I need, and at first I was thinking about using pyautogui to just Ctrl + S every webpage, but there must be another way.
The information that I need is obviosly there, in the html-code, but how do I get it without downloading the entire web page?
Try this to execute the dynamically generated content (JavaScript):
driver.execute_script("return document.body.innerHTML")
See similar question:
Running javascript in Selenium using Python
The CSS selector, div.rating-text, is working just fine and is unique on the page. Returning .text will give you the value you are looking for.
First, you need to wait for the element to be clickable, then make sure you scroll down to the element before getting the rating. Try
element.location_once_scrolled_into_view
element.text
EDIT:
Use the following XPath selector:
'//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]'
Then you will have:
rating = driver.find_element_by_css_selector('//a[#data-test="recommendation-rating-header"]//following-sibling::div//div[#class="rating-text Arrow South Fw(b) Bgc($buy) Bdtc($buy)"]')
To extract the value of the slider, use
val = rating.get_attribute("aria-label")
The script below answers a different question but somehow I think this is what you are after.
import requests
from bs4 import BeautifulSoup
base_url = 'http://finviz.com/screener.ashx?v=152&s=ta_topgainers&o=price&c=0,1,2,3,4,5,6,7,25,63,64,65,66,67'
html = requests.get(base_url)
soup = BeautifulSoup(html.content, "html.parser")
main_div = soup.find('div', attrs = {'id':'screener-content'})
light_rows = main_div.find_all('tr', class_="table-light-row-cp")
dark_rows = main_div.find_all('tr', class_="table-dark-row-cp")
data = []
for rows_set in (light_rows, dark_rows):
for row in rows_set:
row_data = []
for cell in row.find_all('td'):
val = cell.a.get_text()
row_data.append(val)
data.append(row_data)
# sort rows to maintain original order
data.sort(key=lambda x: int(x[0]))
import pandas
pandas.DataFrame(data).to_csv("AAA.csv", header=False)

Web Scraping with Python - Looping for city name, clicking and get interested value

This is my first time with Python and web scraping. Have been looking around and still unable to get what I need to do.
Below are print screen of the elements that I've used via Chrome.
As you can see, it is from the dropdown 'Apartments'.
My 1st step in trying to do is get the list of cities from the drop down
My 2nd step is then, from the given city list, go to each of them (...url.../Brantford/ for example)
My 3rd step is then, given the available apartments, click each of the available apartments to get the price range for each bedroom type
Currently, I am JUST trying to 'loop' through the cities in the first step and it's not working.
Could you please help me out as well, if there's a good forum, article, tutorial etc that's good for beginner like me to read and learn. I'd really like to be good in this so that I may give me to society one day.
Thank you!
import requests
from bs4 import BeautifulSoup
url = 'http://www.homestead.ca/apartments-for-rent/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.find(".child-pages dropdown-menu a href")
print (dropdown_list.prettify())
Screenshot
You can access the elements by the class and a child "a" node. Then access the attribute "href" and add a domain name.
import requests
from bs4 import BeautifulSoup
url = 'http://www.homestead.ca/apartments-for-rent/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.select(".primary .child-pages a")
links=['http://www.homestead.ca'+x['href'] for x in dropdown_list]
print (links)
city_names=[x.text for x in dropdown_list]
print (city_names)
result=[]
for link in links:
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html,'lxml')
...
result.append(...)
Explanation:
soup.select(".primary .child-pages a")
Using CSS selector I select the "a" nodes that are children of a node with the class "child-pages" which is a child of the the node with a class "primary". There were two nodes with class "child-pages" and I filtered one that was under node with "primary" class.
[x.text for x in dropdown_list]
This is a list comprehension in Python. It means that I choose all elements of dropdown_list and then take only the attribute text of each of them and then return as a list.
You can then iterate over the links and append the data to a list (here "result").
I found this introduction to BeautifulSoup pretty good though without going through the links: http://programminghistorian.org/lessons/intro-to-beautiful-soup
I would also recommend reading a book. For example this one: Web Scraping with Python: Collecting Data from the Modern Web

Categories