Python/Xpath - how to scrape an href field - python

As an example, from the following page I would like to extract the team name
http://www.scoresandodds.com/grid_20150409.html
I tried:
from lxml import html
import requests
pageNBA = requests.get('http://www.scoresandodds.com/grid_20150409.html')
treeNBA = html.fromstring(pageNBA.text)
team = treeNBA.xpath('//a[#href="/statfeed/statfeed.php?page=nba/nbateam&teamid=CHICAGO&season="]/text()')
I think my problem is in the team line where I'm defining the location, how should I locate an href.

you can use xpath as follow:
XPath selector expression
//td[#class='name']/a

Related

Getting weather for a country, place bs4

I'm trying to use this website https://www.timeanddate.com/weather/ to scrape data of the weather using BeautifulSoup4 by opening a URL as:
quote_page=r"https://www.timeanddate.com/weather/%s/%s/ext" %(country, place)
I'm still new to web scraping methods and BS4, I can find the information I need in the source of the page (for example, we take country as India and city as Mumbai in this search) linked as: https://www.timeanddate.com/weather/india/mumbai/ext
If you see the page's source, it is not difficult to use CTRL+F and find the attributes of the information like "Humidity", "Dew Point" and current state of the weather (if it's clear, rainy, etc.), the only thing that's preventing me from getting those data is my knowledge of BS4.
Can you inspect the page source and write the BS4 methods to get information like
"Feels Like:", "Visibility", "Dew Point", "Humidity", "Wind" and "Forecast"?
Note: I've done a data scraping exercise before where I had to get the value in an HTML tag like <tag class="someclass">value</tag>
using
`
a=BeautifulSoup.find(tag, attrs={'class':'someclass'})
a=a.text.strip()`
You could familiarize yourself with css selectors
import requests
from bs4 import BeautifulSoup as bs
country = 'india'
place = 'mumbai'
headers = {'User-Agent' : 'Mozilla/5.0',
'Host' : 'www.timeanddate.com'}
quote_page= 'https://www.timeanddate.com/weather/{0}/{1}'.format(country, place)
res = requests.get(quote_page)
soup = bs(res.content, 'lxml')
firstItem = soup.select_one('#qlook p:nth-of-type(2)')
strings = [string for string in firstItem.stripped_strings]
feelsLike = strings[0]
print(feelsLike)
quickFacts = [item.text for item in soup.select('#qfacts p')]
for fact in quickFacts:
print(fact)
The first selector #qlook p:nth-of-type(2) uses an id selector to specify the parent then an :nth-of-type CSS pseudo-class to select the second paragraph type element (p tag) within.
That selector matches:
I use stripped_strings to separate out the individual lines and access required info by index.
The second selector #qfacts p uses an id selector for the parent element and then a descendant combinator with p type selector to specify child p tag elements. That combination matches the following:
quickFacts represent a list of those matches. You can access items by index.

extracting text from css node scrapy

I'm trying to scrape a catalog id number from this page:
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
url = 'http://www.enciclovida.mx/busquedas/resultados?utf8=%E2%9C%93&busqueda=basica&id=&nombre=astomiopsis+exserta&button='
response = HtmlResponse(url=url)
using the css selector (which works in R with rvest::html_nodes)
".result-nombre-container > h5:nth-child(2) > a:nth-child(1)"
I would like to retrieve the catalog id, which in this case should be:
6011038
I'm ok if it is done easier with the xpath
I don't have scrapy here, but tested this xpath and it will get you the href:
//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href
If you're having too much trouble with scrapy and css selector syntax, I would also suggest trying out BeautifulSoup python package. With BeautifulSoup you can do things like
link.get('href')
If you need to parse id from href:
catalog_id = response.xpath("//div[contains(#class, 'result-nombre-container')]/h5[2]/a/#href").re_first( r'(\d+)$' )
There seems to be only one link in the h5 element. So in short:
response.css('h5 > a::attr(href)').re('(\d+)$')

Python scrape table

I'm new to programming so it's very likely my idea of doing what I'm trying to do is totally not the way to do that.
I'm trying to scrape standings table from this site - http://www.flashscore.com/hockey/finland/liiga/ - for now it would be fine if I could even scrape one column with team names, so I try to find td tags with the class "participant_name col_participant_name col_name" but the code returns empty brackets:
import requests
from bs4 import BeautifulSoup
import lxml
def table(url):
teams = []
source = requests.get(url).content
soup = BeautifulSoup(source, "lxml")
for td in soup.find_all("td"):
team = td.find_all("participant_name col_participant_name col_name")
teams.append(team)
print(teams)
table("http://www.flashscore.com/hockey/finland/liiga/")
I tried using tr tag to retrieve whole rows, but no success either.
I think the main problem here is that you are trying to scrape a dynamically generated content using requests, note that there's no participant_name col_participant_name col_name text at all in the HTML source of the page, which means this is being generated with JavaScript by the website. For that job you should use something like selenium together with ChromeDriver or the driver that you find better, below is an example using both of the mentioned tools:
from bs4 import BeautifulSoup
from selenium import webdriver
url = "http://www.flashscore.com/hockey/finland/liiga/"
driver = webdriver.Chrome()
driver.get(url)
source = driver.page_source
soup = BeautifulSoup(source, "lxml")
elements = soup.findAll('td', {'class':"participant_name col_participant_name col_name"})
I think another issue with your code is the way you were trying to access the tags, if you want to match a specific class or any other specific attribute you can do so using a Python's dictionary as an argument of .findAll function.
Now we can use elements to find all the teams' names, try print(elements[0]) and notice that the team's name is inside an a tag, we can access it using .a.text, so something like this:
teams = []
for item in elements:
team = item.a.text
print(team)
teams.append(team)
print(teams)
teams now should be the desired output:
>>> teams
['Assat', 'Hameenlinna', 'IFK Helsinki', 'Ilves', 'Jyvaskyla', 'KalPa', 'Lukko', 'Pelicans', 'SaiPa', 'Tappara', 'TPS Turku', 'Karpat', 'KooKoo', 'Vaasan Sport', 'Jukurit']
teams could also be created using list comprehension:
teams = [item.a.text for item in elements]
Mr Aguiar beat me to it! I will just point out that you can do it all with selenium alone. Of course he is correct in pointing out that this is one of the many sites that loads most of its content dynamically.
You might be interested in observing that I have used an xpath expression. These often make for compact ways of saying what you want. Not too hard to read once you get used to them.
>>> from selenium import webdriver
>>> driver = webdriver.Chrome()
>>> driver.get('http://www.flashscore.com/hockey/finland/liiga/')
>>> items = driver.find_elements_by_xpath('.//span[#class="team_name_span"]/a[text()]')
>>> for item in items:
... item.text
...
'Assat'
'Hameenlinna'
'IFK Helsinki'
'Ilves'
'Jyvaskyla'
'KalPa'
'Lukko'
'Pelicans'
'SaiPa'
'Tappara'
'TPS Turku'
'Karpat'
'KooKoo'
'Vaasan Sport'
'Jukurit'
You're very close.
Start out being a little less ambitious, and just focus on "participant_name". Take a look at https://www.crummy.com/software/BeautifulSoup/bs4/doc/#find-all . I think you want something like:
for td in soup.find_all("td", "participant_name"):
Also, you must be seeing different web content than I am. After a wget of your URL, grep doesn't find "participant_name" in the text at all. You'll want to verify that your code is looking for an ID or a class that is actually present in the HTML text.
Achieving the same using css selector which will let you make the code more readable and concise:
from selenium import webdriver; driver = webdriver.Chrome()
driver.get('http://www.flashscore.com/hockey/finland/liiga/')
for player_name in driver.find_elements_by_css_selector('.participant_name'):
print(player_name.text)
driver.quit()

Scrape a class within a class

I want to scrape class_="href" with in the class_="_e4d". Basically looking to scrape a class within a class using BeautifulSoup.
from bs4 import BeautifulSoup
import selenium.webdriver as webdriver
url = ("https://www.google.com/search?...")
def get_related_search(url):
driver = webdriver.Chrome("C:\\Users\\John\\bin\\chromedriver.exe")
driver.get(url)
soup = BeautifulSoup(driver.page_source)
relate_result = soup.find_all("p", class_="_e4b")
return relate_result[0]
relate_url = get_related_search(url)
print(relate_url)
Results: markup_type=markup_type))
p class="_e4b"}{a href="/search?...a}{/p}
I now want to scrape the href result. I am not sure what the next step would be. Thanks for the help.
Note: I replaced <> with {} since it was not showing up as html script
You can actually find this inner a element in one go with a CSS selector:
links = soup.select("p._e4b a[href]")
for link in links:
print(link['href'])
p._e4b a[href] would locate all a elements having the href attribute inside the p elements having _e4b class.

Extract 2 arguments from web page

I want to extract 2 arguments (title and href) from <a> tag from a wikipedia page.
I want this output eg (https://en.wikipedia.org/wiki/Riddley_Walker):
Canterbury Cathedral
/wiki/Canterbury_Cathedral
The code:
import os, re, lxml.html, urllib
def extractplaces(hlink):
connection = urllib.urlopen(hlink)
places = {}
dom = lxml.html.fromstring(connection.read())
for name in dom.xpath('//a/#title'): # select the url in href for all a tags(links)
print name
In this case i only get #title.
You should get elements with tag a that have title attribute (instead of directly getting the title attribute).And then use .attrib for the element to get the attributes you need. Example -
for name in dom.xpath('//a[#title]'):
print('title :',name.attrib['title'])
print('href :',name.attrib['href'])

Categories