I'm trying to use this website https://www.timeanddate.com/weather/ to scrape data of the weather using BeautifulSoup4 by opening a URL as:
quote_page=r"https://www.timeanddate.com/weather/%s/%s/ext" %(country, place)
I'm still new to web scraping methods and BS4, I can find the information I need in the source of the page (for example, we take country as India and city as Mumbai in this search) linked as: https://www.timeanddate.com/weather/india/mumbai/ext
If you see the page's source, it is not difficult to use CTRL+F and find the attributes of the information like "Humidity", "Dew Point" and current state of the weather (if it's clear, rainy, etc.), the only thing that's preventing me from getting those data is my knowledge of BS4.
Can you inspect the page source and write the BS4 methods to get information like
"Feels Like:", "Visibility", "Dew Point", "Humidity", "Wind" and "Forecast"?
Note: I've done a data scraping exercise before where I had to get the value in an HTML tag like <tag class="someclass">value</tag>
using
`
a=BeautifulSoup.find(tag, attrs={'class':'someclass'})
a=a.text.strip()`
You could familiarize yourself with css selectors
import requests
from bs4 import BeautifulSoup as bs
country = 'india'
place = 'mumbai'
headers = {'User-Agent' : 'Mozilla/5.0',
'Host' : 'www.timeanddate.com'}
quote_page= 'https://www.timeanddate.com/weather/{0}/{1}'.format(country, place)
res = requests.get(quote_page)
soup = bs(res.content, 'lxml')
firstItem = soup.select_one('#qlook p:nth-of-type(2)')
strings = [string for string in firstItem.stripped_strings]
feelsLike = strings[0]
print(feelsLike)
quickFacts = [item.text for item in soup.select('#qfacts p')]
for fact in quickFacts:
print(fact)
The first selector #qlook p:nth-of-type(2) uses an id selector to specify the parent then an :nth-of-type CSS pseudo-class to select the second paragraph type element (p tag) within.
That selector matches:
I use stripped_strings to separate out the individual lines and access required info by index.
The second selector #qfacts p uses an id selector for the parent element and then a descendant combinator with p type selector to specify child p tag elements. That combination matches the following:
quickFacts represent a list of those matches. You can access items by index.
Related
I'm Trying to scrape the status from bulbapedia
I want to get this table from each page
this table isn't in specific place in the page \ sometimes there is multiple of it
i want my script to look for the table in the page and if it find 1 return the element tag and ignore the other ones
here is some pages with the table in different places:
page 1
page 2
page 3
i just want to select the table element and then i will extract the data i need
when working with 'wiki' pages without specific ids or classes, what you really want to do - is to find any type of specific characteristic, which will distinguish focus objects from others.
In your case, if we analyze all three pages, the 'stat table' always has a tag, which href is always /wiki/Statistic.
Therefore, to find this specific table, you have two options:
find each table, which has a tag inside with href equals /wiki/Statistic
find parent tag table of each link with href equals /wiki/Statistic
Here is an example of code:
from bs4 import BeautifulSoup
import requests
pages = [
'https://bulbapedia.bulbagarden.net/wiki/Charmander_(Pokémon)',
'https://bulbapedia.bulbagarden.net/wiki/Bulbasaur_(Pokémon)',
'https://bulbapedia.bulbagarden.net/wiki/Eternatus_(Pokémon)'
]
for page in pages:
response = requests.get(page)
soup = BeautifulSoup(response.text, 'html.parser')
stat_tables = [table for table in soup.find_all('table') if table.find('a') != None and table.find('a')['href'] == '/wiki/Statistic']
# OR
stat_tables = [a.find_parent('table') for a in soup.find_all('a', href = '/wiki/Statistic')]
for table in stat_tables:
# Parse table
Since you said that you want to just extract the table, I'm leaving parsing part on you :)
However, if you have some question, please, feel free to ask.
I use BeautifulSoup for parsing a Google search, but I get empty list. I want to make a spellchecker by using Google's "Did you mean?".
import requests
from bs4 import BeautifulSoup
import urllib.parse
text = "i an you ate goode maan"
data = urllib.parse.quote_plus(text)
url = 'https://translate.google.com/?source=osdd#view=home&op=translate&sl=auto&tl=en&text='
rq = requests.get(url + data)
soup = BeautifulSoup(rq.content, 'html.parser')
words = soup.select('.tlid-spelling-correction spelling-correction gt-spell-correct-message')
print(words)
The output is just: [], but expected: "i and you are good man" (sorry for such a bad text example)
First, the element you are looking for is loaded using javascript. Since BeautifulSoup does not run js, the target elements don't get loaded into the DOM hence the query selector can't find them. Try using Selenium instead of BeautifulSoup.
Second, The CSS selector should be
.tlid-spelling-correction.spelling-correction.gt-spell-correct-message`.
Notice the . instead of space in front of every class name.
I have verified it using JS query selector
The selector you were using .tlid-spelling-correction spelling-correction gt-spell-correct-message was looking for an element with class gt-spell-correct-message inside an element with class spelling-correction which itself was inside another element with class tlid-spelling-correction.
By removing the space and putting a dot in front of every class name, the selector looks for an element with all three of the above mentioned classes.
This is my first time with Python and web scraping. Have been looking around and still unable to get what I need to do.
Below are print screen of the elements that I've used via Chrome.
As you can see, it is from the dropdown 'Apartments'.
My 1st step in trying to do is get the list of cities from the drop down
My 2nd step is then, from the given city list, go to each of them (...url.../Brantford/ for example)
My 3rd step is then, given the available apartments, click each of the available apartments to get the price range for each bedroom type
Currently, I am JUST trying to 'loop' through the cities in the first step and it's not working.
Could you please help me out as well, if there's a good forum, article, tutorial etc that's good for beginner like me to read and learn. I'd really like to be good in this so that I may give me to society one day.
Thank you!
import requests
from bs4 import BeautifulSoup
url = 'http://www.homestead.ca/apartments-for-rent/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.find(".child-pages dropdown-menu a href")
print (dropdown_list.prettify())
Screenshot
You can access the elements by the class and a child "a" node. Then access the attribute "href" and add a domain name.
import requests
from bs4 import BeautifulSoup
url = 'http://www.homestead.ca/apartments-for-rent/'
response = requests.get(url)
html = response.content
soup = BeautifulSoup(html,'lxml')
dropdown_list = soup.select(".primary .child-pages a")
links=['http://www.homestead.ca'+x['href'] for x in dropdown_list]
print (links)
city_names=[x.text for x in dropdown_list]
print (city_names)
result=[]
for link in links:
response = requests.get(link)
html = response.content
soup = BeautifulSoup(html,'lxml')
...
result.append(...)
Explanation:
soup.select(".primary .child-pages a")
Using CSS selector I select the "a" nodes that are children of a node with the class "child-pages" which is a child of the the node with a class "primary". There were two nodes with class "child-pages" and I filtered one that was under node with "primary" class.
[x.text for x in dropdown_list]
This is a list comprehension in Python. It means that I choose all elements of dropdown_list and then take only the attribute text of each of them and then return as a list.
You can then iterate over the links and append the data to a list (here "result").
I found this introduction to BeautifulSoup pretty good though without going through the links: http://programminghistorian.org/lessons/intro-to-beautiful-soup
I would also recommend reading a book. For example this one: Web Scraping with Python: Collecting Data from the Modern Web
I am slightly confused about how to use BeautifulSoup to navigate the HTML tree.
import requests
from bs4 import BeautifulSoup
url = 'http://examplewebsite.com'
source = requests.get(url)
content = source.content
soup = BeautifulSoup(source.content, "html.parser")
# Now I navigate the soup
for a in soup.findAll('a'):
print a.get("href")
Is there a way to find only particular href by the labels? For example, all the href's I want are called by a certain name, e.g. price in an online catalog.
The href links I want are all in a certain location within the webpage, within the page's and a certain . Can I access only these links?
How can I scrape the contents within each href link and save into a file format?
With BeautifulSoup, that's all doable and simple.
(1) Is there a way to find only particular href by the labels? For
example, all the href's I want are called by a certain name, e.g.
price in an online catalog.
Say, all the links you need have price in the text - you can use a text argument:
soup.find_all("a", text="price") # text equals to 'price' exactly
soup.find_all("a", text=lambda text: text and "price" in text) # 'price' is inside the text
Yes, you may use functions and many other different kind of objects to filter elements, like, for example, compiled regular expressions:
import re
soup.find_all("a", text=re.compile(r"^[pP]rice"))
If price is somewhere in the "href" attribute, you can have use the following CSS selector:
soup.select("a[href*=price]") # href contains 'price'
soup.select("a[href^=price]") # href starts with 'price'
soup.select("a[href$=price]") # href ends with 'price'
or, via find_all():
soup.find_all("a", href=lambda href: href and "price" in href)
(2) The href links I want are all in a certain location within the
webpage, within the page's and a certain . Can I access only these
links?
Sure, locate the appropriate container and call find_all() or other searching methods:
container = soup.find("div", class_="container")
for link in container.select("a[href*=price"):
print(link["href"])
Or, you may write your CSS selector the way you search for links inside a specific element having the desired attribute or attribute values. For example, here we are searching for a elements having href attributes located inside a div element having container class:
soup.select("div.container a[href]")
(3) How can I scrape the contents within each href link and save into
a file format?
If I understand correctly, you need to get appropriate links, follow them and save the source code of the pages locally into HTML files. There are multiple options to choose from depending on your requirements (for instance, speed may be critical. Or, it's just a one-time task and you don't care about performance).
If you would stay with requests, the code would be of a blocking nature - you'll extract the link, follow it, save the page source and then proceed to a next one - the main downside of it is that it would be slow (depending on, for starters, how much links are there). Sample code to get you going:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
base_url = 'http://examplewebsite.com'
with requests.Session() as session: # maintaining a web-scraping session
soup = BeautifulSoup(session.get(base_url).content, "html.parser")
for link in soup.select("div.container a[href]"):
full_link = urljoin(base_url, link["href"])
title = a.get_text(strip=True)
with open(title + ".html", "w") as f:
f.write(session.get(full_link).content)
You may look into grequests or Scrapy to solve that part.
I want to extract 2 arguments (title and href) from <a> tag from a wikipedia page.
I want this output eg (https://en.wikipedia.org/wiki/Riddley_Walker):
Canterbury Cathedral
/wiki/Canterbury_Cathedral
The code:
import os, re, lxml.html, urllib
def extractplaces(hlink):
connection = urllib.urlopen(hlink)
places = {}
dom = lxml.html.fromstring(connection.read())
for name in dom.xpath('//a/#title'): # select the url in href for all a tags(links)
print name
In this case i only get #title.
You should get elements with tag a that have title attribute (instead of directly getting the title attribute).And then use .attrib for the element to get the attributes you need. Example -
for name in dom.xpath('//a[#title]'):
print('title :',name.attrib['title'])
print('href :',name.attrib['href'])