Extract 2 arguments from web page - python

I want to extract 2 arguments (title and href) from <a> tag from a wikipedia page.
I want this output eg (https://en.wikipedia.org/wiki/Riddley_Walker):
Canterbury Cathedral
/wiki/Canterbury_Cathedral
The code:
import os, re, lxml.html, urllib
def extractplaces(hlink):
connection = urllib.urlopen(hlink)
places = {}
dom = lxml.html.fromstring(connection.read())
for name in dom.xpath('//a/#title'): # select the url in href for all a tags(links)
print name
In this case i only get #title.

You should get elements with tag a that have title attribute (instead of directly getting the title attribute).And then use .attrib for the element to get the attributes you need. Example -
for name in dom.xpath('//a[#title]'):
print('title :',name.attrib['title'])
print('href :',name.attrib['href'])

Related

Getting weather for a country, place bs4

I'm trying to use this website https://www.timeanddate.com/weather/ to scrape data of the weather using BeautifulSoup4 by opening a URL as:
quote_page=r"https://www.timeanddate.com/weather/%s/%s/ext" %(country, place)
I'm still new to web scraping methods and BS4, I can find the information I need in the source of the page (for example, we take country as India and city as Mumbai in this search) linked as: https://www.timeanddate.com/weather/india/mumbai/ext
If you see the page's source, it is not difficult to use CTRL+F and find the attributes of the information like "Humidity", "Dew Point" and current state of the weather (if it's clear, rainy, etc.), the only thing that's preventing me from getting those data is my knowledge of BS4.
Can you inspect the page source and write the BS4 methods to get information like
"Feels Like:", "Visibility", "Dew Point", "Humidity", "Wind" and "Forecast"?
Note: I've done a data scraping exercise before where I had to get the value in an HTML tag like <tag class="someclass">value</tag>
using
`
a=BeautifulSoup.find(tag, attrs={'class':'someclass'})
a=a.text.strip()`
You could familiarize yourself with css selectors
import requests
from bs4 import BeautifulSoup as bs
country = 'india'
place = 'mumbai'
headers = {'User-Agent' : 'Mozilla/5.0',
'Host' : 'www.timeanddate.com'}
quote_page= 'https://www.timeanddate.com/weather/{0}/{1}'.format(country, place)
res = requests.get(quote_page)
soup = bs(res.content, 'lxml')
firstItem = soup.select_one('#qlook p:nth-of-type(2)')
strings = [string for string in firstItem.stripped_strings]
feelsLike = strings[0]
print(feelsLike)
quickFacts = [item.text for item in soup.select('#qfacts p')]
for fact in quickFacts:
print(fact)
The first selector #qlook p:nth-of-type(2) uses an id selector to specify the parent then an :nth-of-type CSS pseudo-class to select the second paragraph type element (p tag) within.
That selector matches:
I use stripped_strings to separate out the individual lines and access required info by index.
The second selector #qfacts p uses an id selector for the parent element and then a descendant combinator with p type selector to specify child p tag elements. That combination matches the following:
quickFacts represent a list of those matches. You can access items by index.

Missing tags while using beautifulsoup

I'm using beautifulsoup to parse html file. However, some tags are missing by find_all() method. The html link is YARN-8569
The code is here:
for tag in soup.find_all('div', class_='js-diff-progressive-container'):
print 1
for div in tag.find_all('div'):
id = div.get('id')
if id:
id = id.split('-')
print id
if id[0] == 'diff':
div2 = div.find_all('div')
class_div = div2[0]
if class_div.get('data-path'):
changed_class.append(class_div.get('data-path'))
However, I can only open the first div tag with class of 'js-diff-progressive-container' and get its child tag. For the second one I will get a dev whose class name is 'js-diff-progressive-retry'(I can't find this in the html file). Also, I can't get its child tags.
The output is
I use lxml as my htmlparser(which is the answer suggested by others but it still doesn't work)

using lxml and requests in python to grab text between certain tags with a specific class name

I am trying to grab all text between a tag that has a specific class name. I believe I am very close to getting it right, so I think all it'll take is a simple fix.
In the website these are the tags I'm trying to retrieve data from. I want 'SNP'.
<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>
From what I have currently:
from lxml import html
import requests
def main():
url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
page = html.fromstring(requests.get(url_link).text)
for span_tag in page.xpath("//span"):
class_name = span_tag.get("class")
if class_name is not None:
if "rtq_exch" == class_name:
print(url_link, span_tag.text)
if __name__ == "__main__":main()
I get this:
http://finance.yahoo.com/q?s=^GSPC&d=t None
To show that it works, when I change this line:
if "rtq_dash" == class_name:
I get this (please note the '-' which is the same content between the tags):
http://finance.yahoo.com/q?s=^GSPC&d=t -
What I think is happening is it sees the child tag and stops grabbing the data, but I'm not sure why.
I would be happy with receiving
<span class="rtq_dash">-</span>SNP
as a string for span_tag.text, as I can easily chop off what I don't want.
A higher description, I'm trying to get the stock symbol from the page.
Here is the documentation for requests, and here is the documentation for lxml (xpath).
I want to use xpath instead of BeautifulSoup for several reasons, so please don't suggest changing to use that library instead, not that it'd be any easier anyway.
There are some possible ways. You can find the outer span and return direct-child text node of it :
>>> url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
>>> page = html.fromstring(requests.get(url_link).text)
>>> for span_text in page.xpath("//span[#class='rtq_exch']/text()"):
... print(span_text)
...
SNP
or find the inner span and get the tail :
>>> for span_tag in page.xpath("//span[#class='rtq_dash']"):
... print(span_tag.tail)
...
SNP
Use BeautifulSoup:
import bs4
html = """<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>"""
soup = bs4.BeautifulSoup(html)
snp = list(soup.findAll("span", class_="rtq_exch")[0].strings)[1]

Python/Xpath - how to scrape an href field

As an example, from the following page I would like to extract the team name
http://www.scoresandodds.com/grid_20150409.html
I tried:
from lxml import html
import requests
pageNBA = requests.get('http://www.scoresandodds.com/grid_20150409.html')
treeNBA = html.fromstring(pageNBA.text)
team = treeNBA.xpath('//a[#href="/statfeed/statfeed.php?page=nba/nbateam&teamid=CHICAGO&season="]/text()')
I think my problem is in the team line where I'm defining the location, how should I locate an href.
you can use xpath as follow:
XPath selector expression
//td[#class='name']/a

Navigation with BeautifulSoup

I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.
import requests
from bs4 import BeautifulSoup
url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")
# Now I navigate the soup
for a in soup.findAll('a'):
print a.get("href")
Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.
The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?
How can I scrape the contents within each href link and save into a file format?
With BeautifulSoup, that's all doable and simple.
(1) Is there a way to find only particular href by the labels? For
example, all the href's I want are called by a certain name, e.g.
price in an online catalog.
Say, all the links you need have price in the text - you can use a text argument:
soup.find_all("a", text="price") # text equals to 'price' exactly
soup.find_all("a", text=lambda text: text and "price" in text) # 'price' is inside the text
Yes, you may use functions and many other different kind of objects to filter elements, like, for example, compiled regular expressions:
import re
soup.find_all("a", text=re.compile(r"^[pP]rice"))
If price is somewhere in the "href" attribute, you can have use the following CSS selector:
soup.select("a[href*=price]") # href contains 'price'
soup.select("a[href^=price]") # href starts with 'price'
soup.select("a[href$=price]") # href ends with 'price'
or, via find_all():
soup.find_all("a", href=lambda href: href and "price" in href)
(2) The href links I want are all in a certain location within the
webpage, within the page's and a certain . Can I access only these
links?
Sure, locate the appropriate container and call find_all() or other searching methods:
container = soup.find("div", class_="container")
for link in container.select("a[href*=price"):
print(link["href"])
Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for a elements having href attributes located inside a div element having container class:
soup.select("div.container a[href]")
(3) How can I scrape the contents within each href link and save into
a file format?
If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).
If you would stay with requests, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://examplewebsite.com'
with requests.Session() as session: # maintaining a web-scraping session
soup = BeautifulSoup(session.get(base_url).content, "html.parser")
for link in soup.select("div.container a[href]"):
full_link = urljoin(base_url, link["href"])
title = a.get_text(strip=True)
with open(title + ".html", "w") as f:
f.write(session.get(full_link).content)
You may look into grequests or Scrapy to solve that part.

Categories