Get headlines from web archive

Get headlines from web archive - python

I am trying to get headline from www.bbc.co.uk/news. The code I have works fine and it is as below:
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re
opener = urllib2.build_opener()
url = 'http://www.bbc.co.uk/news'
soup = BeautifulSoup(opener.open(url), "lxml")
titleTag = soup.html.head.title
print(titleTag.string)
titles = soup.find_all('span', {'class' : 'title-link__title-text'})
headlines = [t.text for t in titles]
print(headlines)
But I would like to build a dataset from a given date, let's say 1st April 2016. But the headlines keep on changing during the day and BBC does not keep the history.
So I thought to get it from web archive. For example, I would like to get headlines from this url (http://web.archive.org/web/20160203074646/http://www.bbc.co.uk/news) for the timestamp 20160203074646.
When I paste the url in my code, the output contains the headlines.
EDIT
But how do I automate this process for all the timestamps?

To see all snapshots for a given URL, replace the timestamp with an asterisk:
http://web.archive.org/web/*/http://www.bbc.co.uk
then screen scrape that.
A few things to consider:
The Wayback API will give you the nearest single snapshot to a given timestamp. You seem like you want all available snapshots, which is why I suggested screen scraping.
The BBC might change headlines faster than the Wayback Machine can snapshot them.
The BBC provides RSS feeds which can be parsed more reliably. There is a listing under "Choose a Feed".
EDIT: have a look at the feedparser docs
import feedparser
d = feedparser.parse('http://feeds.bbci.co.uk/news/rss.xml?edition=uk')
d.entries[0]
Output
{'guidislink': False,
'href': u'',
'id': u'http://www.bbc.co.uk/news/world-europe-37003819',
'link': u'http://www.bbc.co.uk/news/world-europe-37003819',
'links': [{'href': u'http://www.bbc.co.uk/news/world-europe-37003819',
'rel': u'alternate',
'type': u'text/html'}],
'media_thumbnail': [{'height': u'432',
'url': u'http://c.files.bbci.co.uk/12A34/production/_90704367_mediaitem90704366.jpg',
'width': u'768'}],
'published': u'Sun, 07 Aug 2016 21:24:36 GMT',
'published_parsed': time.struct_time(tm_year=2016, tm_mon=8, tm_mday=7, tm_hour=21, tm_min=24, tm_sec=36, tm_wday=6, tm_yday=220, tm_isdst=0),
'summary': u"Turkey's President Erdogan tells a huge rally in Istanbul that he would approve the return of the death penalty if it was backed by parliament and the public.",
'summary_detail': {'base': u'http://feeds.bbci.co.uk/news/rss.xml?edition=uk',
'language': None,
'type': u'text/html',
'value': u"Turkey's President Erdogan tells a huge rally in Istanbul that he would approve the return of the death penalty if it was backed by parliament and the public."},
'title': u'Turkey death penalty: Erdogan backs return at Istanbul rally',
'title_detail': {'base': u'http://feeds.bbci.co.uk/news/rss.xml?edition=uk',
'language': None,
'type': u'text/plain',
'value': u'Turkey death penalty: Erdogan backs return at Istanbul rally'}}

Related

Failing to webscrape titles and authors in a website with multi-links

I am trying to webscrape this link. As an example, I just want to scrape the first page. I would like to collect titles and authors for each of the 10 link you find in the first page.
To gather titles and authors, I write the following line of code:
from bs4 import BeautifulSoup
import requests
import numpy as np
url = 'https://www.bis.org/cbspeeches/index.htm?m=1123'
r = BeautifulSoup(requests.get(url).content, features = "lxml")
r.select('#cbspeeches_list a') # '#cbspeeches_list a' got via SelectorGadget
However, I get an empty list. What am I doing wrong?
Thanks!

Data is loaded from external source by API as post method. Just you have to use the API url.
from bs4 import BeautifulSoup
import requests
payload = 'from=&till=&objid=cbspeeches&page=&paging_length=10&sort_list=date_desc&theme=cbspeeches&ml=false&mlurl=&emptylisttext='
url= 'https://www.bis.org/doclist/cbspeeches.htm'
headers= {
"content-type": "application/x-www-form-urlencoded",
"X-Requested-With": "XMLHttpRequest"
}
req=requests.post(url,headers=headers,data=payload)
print(req)
soup = BeautifulSoup(req.content, "lxml")
data=[]
for card in soup.select('.documentList tbody tr'):
title = card.select_one('.title a').get_text()
author = card.select_one('.authorlnk.dashed').get_text().strip()
data.append({
'title':title,
'author':author
})
print(data)
Output
[{'title': 'Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022', 'author': '\nPablo Hernández de Cos'}, {'title': 'Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank ', 'author': '\nKlaas Knot'}, {'title': 'Luis de Guindos: Challenges for monetary policy', 'author': '\nLuis de Guindos'}, {'title': 'Fabio Panetta: Europe as a common
shield - protecting the euro area economy from global shocks', 'author': '\nFabio Panetta'},
{'title': 'Victoria Cleland: Rowing in unison to enhance cross-border payments', 'author': '\nVictoria Cleland'}, {'title': 'Yaron Amir: A look at the future world of payments - trends, the market, and regulation', 'author': '\nYaron Amir'}, {'title': 'Ásgeir Jónsson: Speech – 61st Annual Meeting of the Central Bank of Iceland', 'author': '\nÁsgeir Jónsson'}, {'title': 'Lesetja Kganyago: Project Khokha 2 report launch', 'author': '\nLesetja Kganyago'}, {'title': 'Huw Pill: What did the monetarists ever do for us?', 'author': '\nHuw Pill'}, {'title': 'Shaktikanta Das: Inaugural address - Statistics Day Conference ', 'author': '\nShaktikanta Das'}]

Try this:
data = {
'from': '',
'till': '',
'objid': 'cbspeeches',
'page': '',
'paging_length': '25',
'sort_list': 'date_desc',
'theme': 'cbspeeches',
'ml': 'false',
'mlurl': '',
'emptylisttext': ''
}
response = requests.post('https://www.bis.org/doclist/cbspeeches.htm', data=data)
soup = BeautifulSoup(response.content)
for elem in soup.find_all("tr"):
# the title
print(elem.find("a").text)
# the author
print(elem.find("a", class_="authorlnk dashed").text)
print()
Prints out:
Pablo Hernández de Cos: Closing ceremony of the academic year 2021-2022
Pablo Hernández de Cos
Klaas Knot: Keti Koti 2022 marks turning point for the Netherlands Bank
Klaas Knot

I am trying to webscrape from Zomato, however it returns with an output of "None" and Attribute Error

Whenever i try to extract the data, it returns an output of "None" which I am not sure of is it the code (I followed the rules of using bs4) or is it just the website that's different to scrape?
My code:
import requests
import bs4 as bs
url = 'https://www.zomato.com/jakarta/pondok-indah-restaurants'
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = req.text
soup = bs.BeautifulSoup(html, "html.parser")
listings = soup.find('div', class_='sc-gAmQfK fKxEbD')
rest_name = listings.find('h4', class_='sc-1hp8d8a-0 sc-eTyWNx gKsZcT').text
##Output: AttributeError: 'NoneType' object has no attribute 'find'
print(listings)
##returns None
Here is the inspected tag of the website which i try to get the h4 class showing the restaurant's name:
inspected element

What happens?
Classes are generated dynamically and may differ from your inspections via developer tools - So you won't find what you are looking for.
How to fix?
It would be a better approach to select your targets via tag or id if available, cause these are more static than css classes.
listings = soup.select('a:has(h4)')
Example
Iterating listings and scrape several infromation:
import requests
import bs4 as bs
url = 'https://www.zomato.com/jakarta/pondok-indah-restaurants'
req = requests.get(url, headers={'User-Agent': 'Mozilla/5.0'})
html = req.text
soup = bs.BeautifulSoup(html, "html.parser")
data = []
for item in soup.select('a:has(h4)'):
data.append({
'title':item.h4.text,
'url':item['href'],
'etc':'...'
})
print(data)
Output
[{'title': 'Radio Dalam Diner', 'url': '/jakarta/radio-dalam-diner-pondok-indah/info', 'etc': '...'}, {'title': 'Aneka Bubur 786', 'url': '/jakarta/aneka-bubur-786-pondok-indah/info', 'etc': '...'}, {'title': "McDonald's", 'url': '/jakarta/mcdonalds-pondok-indah/info', 'etc': '...'}, {'title': 'KOPIKOBOY', 'url': '/jakarta/kopikoboy-pondok-indah/info', 'etc': '...'}, {'title': 'Kopitelu', 'url': '/jakarta/kopitelu-pondok-indah/info', 'etc': '...'}, {'title': 'KFC', 'url': '/jakarta/kfc-pondok-indah/info', 'etc': '...'}, {'title': 'HokBen Delivery', 'url': '/jakarta/hokben-delivery-pondok-indah/info', 'etc': '...'}, {'title': 'PHD', 'url': '/jakarta/phd-pondok-indah/info', 'etc': '...'}, {'title': 'Casa De Jose', 'url': '/jakarta/casa-de-jose-pondok-indah/info', 'etc': '...'}]

Web Scraping Udemy page to get the price of course using python, Beautiful Soup, getting None

Lines of code are:
import requests
from bs4 import BeautifulSoup
hd = {'Accept-Language': 'en,en-US'}
res = requests.get('https://www.udemy.com/courses/search/?q=python%20web%20scraping&src=sac&kw=python%20web%20sc', headers = hd)
soup = BeautifulSoup(res.content, 'lxml')
courses = soup.find('div', class_='popper--popper--19faV popper--popper-hover--4YJ5J')
print(courses)
I am trying to get the course name from -div class name 'popper--popper--19faV popper--popper-hover--4YJ5J' but getting 'None'
Any suggestion how to get the course name and later the current price? Thank you.

You are dealing with dynamic content so you may try selenium.
Example
from selenium import webdriver
from bs4 import BeautifulSoup
from time import sleep
driver = webdriver.Chrome(executable_path='C:\Program Files\ChromeDriver\chromedriver.exe')
url = "https://www.udemy.com/courses/search/?q=python%20web%20scraping&src=sac&kw=python%20web%20sc"
driver.get(url)
sleep(5)
soup = BeautifulSoup(driver.page_source, "lxml")
data = []
for course in soup.select('div.course-list--container--3zXPS > div.popper--popper--19faV.popper--popper-hover--4YJ5J'):
name = course.select_one('div.udlite-focus-visible-target.udlite-heading-md.course-card--course-title--2f7tE').get_text(strip=True)
price = course.select_one('div.price-text--price-part--Tu6MH.course-card--discount-price--3TaBk.udlite-heading-md span > span').get_text(strip=True).replace('\xa0€','')
data.append({'name':name,'price':price})
driver.close()
data
Output
[{'name': 'Reguläre Ausdrücke (Regular Expressions) in Python',
'price': '14,99'},
{'name': 'Python Bootcamp: Vom Anfänger zum Profi, inkl. Data Science',
'price': '13,99'},
{'name': 'WebScraping - Automatisiert Daten sammeln!', 'price': '13,99'},
{'name': 'Fortgeschrittene Python Programmierung', 'price': '13,99'},
{'name': 'Python Bootcamp: Der Einstiegskurs', 'price': '14,99'},
{'name': 'Python - Das Python Grundlagen Bootcamp - Von 0 auf 100!',
'price': '13,99'},
{'name': 'Python A-Z - Lerne es schnell & einfach, inkl. Data Science!',
'price': '13,99'},
{'name': 'Data Science & Maschinelles Lernen in Python - am Beispiel',
'price': '13,99'},
{'name': 'Schnelleinstieg in die Python Programmierung für Anfänger',
'price': '13,99'},
{'name': 'Visualisiere Daten mit Python - auch für Anfänger!',
'price': '18,99'},
{'name': 'Python-Entwicklung für Einsteiger', 'price': '13,99'},
{'name': 'Python 3 - Einführung in die Programmierung', 'price': '13,99'},
{'name': 'Dash - Interaktive Python Visualisierungen für Data Science',
'price': '21,99'},
{'name': 'Python für Data Science, Machine Learning & Visualization',
'price': '13,99'},
{'name': 'Python in 4 Stunden von Null zum Python Programmierer',
'price': '13,99'},
{'name': 'Python 3 programmieren - Einsteigerkurs', 'price': '13,99'},
{'name': 'Python für Einsteiger, inkl. Data Science', 'price': '13,99'},
{'name': 'Lambda Funktionen & List Comprehensions in Python',
'price': '13,99'},
{'name': 'Deep Learning, Neuronale Netze und TensorFlow 2 in Python',
'price': '18,99'},
{'name': 'Python Crashkurs für (Quer) Einsteiger', 'price': '13,99'}]

Look at the examples from Beautiful Soup 4 Docs:
Remember that a single tag can have multiple values for its “class” attribute. When you search for a tag that matches a certain CSS class, you’re matching against any of its CSS classes
...
You can also search for the exact string value of the class attribute
When searching for a class and passing in a string, variants of the string will not work. Your code searches for multiple classes on an element and BS4 will try to match an element with exactly those classes. Looking at the HTML in the provided webpage though we see another class header--gap-button--3bIww. The presence of this additional class causes BS4 to not find the element.
As another commenter said, it is also a nav element, not div.
# Fails because multiple classes are listed, but the list is not exhaustive (or in order)
courses = soup.find("nav", class_="popper--popper--19faV popper--popper-hover--4YJ5J")
# Works because all classes are listed in order
courses = soup.find("nav", class_="header--gap-button--3bIww popper--popper--19faV popper--popper-hover--4YJ5J",)
# Works because only one class is listed and matched against
courses = soup.find("nav", class_="popper--popper--19faV")

Python Returning An Empty List Using Beautiful Soup HTML Parsing

I'm currently working on a project that involves web scraping a real estate website (for educational purposes). I'm taking data from home listings like address, price, bedrooms, etc.
After building and testing along the way with the print function (it worked successfully!), I'm now building a dictionary for each data point in the listing. I'm storing that dictionary in a list in order to eventually use Pandas to create a table and send to a CSV.
Here is my problem. My list is displaying an empty dictionary with no error. Please note, I've successfully scraped the data already and have seen the data when using the print function. Now its displaying nothing after adding each data point to a dictionary and putting it in a list. Here is my code:
import requests
from bs4 import BeautifulSoup
r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content
soup=BeautifulSoup(c,"html.parser")
all=soup.find_all("div", {"class":"infinite-item"})
all[0].find("a",{"class":"listing-price"}).text.replace("\n","").replace(" ","")
l=[]
for item in all:
d={}
try:
d["Price"]=item.find("a",{"class":"listing-price"}.text.replace("\n","").replace(" ",""))
d["Address"]=item.find("div",{"class":"property-address"}).text.replace("\n","").replace(" ","")
d["City"]=item.find_all("div",{"class":"property-city"})[0].text.replace("\n","").replace(" ","")
try:
d["Beds"]=item.find("div",{"class":"property-beds"}).find("strong").text
except:
d["Beds"]=None
try:
d["Baths"]=item.find("div",{"class":"property-baths"}).find("strong").text
except:
d["Baths"]=None
try:
d["Area"]=item.find("div",{"class":"property-sqft"}).find("strong").text
except:
d["Area"]=None
except:
pass
l.append(d)
When I call l (the list that contains my dictionary) - this is what I get:
[{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{},
{}]
I'm using Python 3.8.2 with Beautiful Soup 4. Any ideas or help with this would be greatly appreciated. Thanks!

This does what you want much more concisely and is more pythonic (using nested list comprehension):
import requests
from bs4 import BeautifulSoup
r = requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c = r.content
soup = BeautifulSoup(c, "html.parser")
css_classes = [
"listing-price",
"property-address",
"property-city",
"property-beds",
"property-baths",
"property-sqft",
]
pl = [{css_class.split('-')[1]: item.find(class_=css_class).text.strip() # this shouldn't error if not found
for css_class in css_classes} # find each class in the class list
for item in soup.find_all('div', class_='property-card-primary-info')] # find each property card div
print(pl)
Output:
[{'address': '512 Silver Oak Grove',
'baths': '6 baths',
'beds': '4 beds',
'city': 'Colorado Springs CO 80906',
'price': '$1,595,000',
'sqft': '6,958 sq. ft'},
{'address': '8910 Edgefield Drive',
'baths': '5 baths',
'beds': '5 beds',
'city': 'Colorado Springs CO 80920',
'price': '$499,900',
'sqft': '4,557 sq. ft'},
{'address': '135 Mayhurst Avenue',
'baths': '3 baths',
'beds': '3 beds',
'city': 'Colorado Springs CO 80906',
'price': '$420,000',
'sqft': '1,889 sq. ft'},
{'address': '7925 Bard Court',
'baths': '4 baths',
'beds': '5 beds',
'city': 'Colorado Springs CO 80920',
'price': '$405,000',
'sqft': '3,077 sq. ft'},
{'address': '7641 N Sioux Circle',
'baths': '3 baths',
'beds': '4 beds',
'city': 'Colorado Springs CO 80915',
'price': '$389,900',
'sqft': '3,384 sq. ft'},
...
]

You should use function to do the repetitive job. This would make your code clearer.
I've managed this code, that is working:
import requests
from bs4 import BeautifulSoup
def find_div_and_get_value(soup, html_balise, attributes):
return soup.find(html_balise, attrs=attributes).text.replace("\n","").strip()
def find_div_and_get_value2(soup, html_balise, attributes):
return soup.find(html_balise, attrs=attributes).find('strong').text.replace("\n","").strip()
r=requests.get("https://www.century21.com/real-estate/colorado-springs-co/LCCOCOLORADOSPRINGS/")
c=r.content
soup = BeautifulSoup(c,"html.parser")
houses = soup.findAll("div", {"class":"infinite-item"})
l=[]
for house in houses:
try:
d = {}
d["Price"] = find_div_and_get_value(house, 'a', {"class": "listing-price"})
d["Address"] = find_div_and_get_value(house, 'div', {"class": "property-address"})
d["City"] = find_div_and_get_value(house, 'div', {"class":"property-city"})
d["Beds"] = find_div_and_get_value2(house, 'div', {"class":"property-beds"})
d["Baths"] = find_div_and_get_value2(house, 'div', {"class":"property-baths"})
d["Area"] = find_div_and_get_value2(house, 'div', {"class":"property-sqft"})
l.append(d)
except:
break
for house in l:
print(house)

Web Scraping: getting KeyError when parsing JSON in Python

I want to extract the full address from the webpage and I'm using BeautifulSoup and JSON.
Here's my code:
import bs4
import json
from bs4 import BeautifulSoup
import requests
url = 'xxxxxxxxxxxxxxxxx'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('div', attrs={'data-integration-name':'redux-container'}):
info = json.loads(i.get('data-payload'))
I printed 'info' out:
{'storeName': None, 'props': {'locations': [{'dirty': False, 'updated_at': '2016-05-05T07:57:19.282Z', 'country_code': 'US', 'company_id': 106906, 'longitude': -74.0001954, 'address': '5 Crosby St 3rd Floor', 'state': 'New York', 'full_address': '5 Crosby St 3rd Floor, New York, 10013, New York, USA', 'country': 'United States', 'id': 17305, 'to_params': 'new-york-us', 'latitude': 40.719753, 'region': '', 'city': 'New York', 'description': '', 'created_at': '2015-01-19T01:32:16.317Z', 'zip_code': '10013', 'hq': True}]}, 'name': 'LocationsMapList'}
What I want is the "full_address" under "location" so my code was:
info = json.loads(i.get('data-payload'))
for i in info['props']['locations']:
print (i['full_address'])
But I got this error:
----> 5 for i in info['props']['locations']:
KeyError: 'locations'
I want to print the full address out, which is '5 Crosby St 3rd Floor, New York, 10013, New York, USA'.
Thanks a lot!

The data you are parsing seem to be inconsistent, the keys are not in all objects.
If you still want to perform a loop, you need to use a try/except statement to catch an exception, or the method get to set a fallback when you're looking for a key in a dictionary that could be not here.
info = json.loads(i.get('data-payload'))
for item in info['props'].get('locations', []):
print (item.get('full_address', 'no address'))
get('locations', []) : returns an empty list if the key location doesn't exist, so the loop doesn't run any iteration.
get('full_address', 'no address') : returns "no adress" in case there is no such key
EDIT :
The data are inconsistent (never trust data). Some JSON objects have a key props with a null /None value. The next fix should correct that :
info = json.loads(i.get('data-payload'))
if info.get('props'):
for item in info['props'].get('locations', []):
print (item.get('full_address', 'no address'))

Your first object is fine, but it's clear that your second object has no locations key anywhere, nor full_address.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Get headlines from web archive - python

Related

Failing to webscrape titles and authors in a website with multi-links

I am trying to webscrape from Zomato, however it returns with an output of "None" and Attribute Error

Web Scraping Udemy page to get the price of course using python, Beautiful Soup, getting None

Python Returning An Empty List Using Beautiful Soup HTML Parsing

Web Scraping: getting KeyError when parsing JSON in Python

Categories

Resources