Beautiful soup and extracting values - python

I would be gretful if you could give me some guidance on how I would grab the date of birth "16 June 1723" below while using beautifulsoup. Now using my code I have managed to grab the values which you see below under results however all what I need is to only grab the value 16 June 1723. any advice?
My code:
birth = soup.find("table",{"class":"infobox"})
test = birth.find(text='Born')
next_cell = test.find_parent('th').find_next_sibling('td').get_text()
print next_cell
Result:
16 June 1723 NS (5 June 1723 OS)Kirkcaldy, Scotland,Great Britain

Instead of last print statement, add this
print ' '.join(str(next_cell).split()[:3])

Related

Scraping all entries of lazyloading page using python

See this page with ECB press releases. These go back to 1997, so it would be nice to automate getting all the links going back in time.
I found the tag that harbours the links ('//*[#id="lazyload-container"]'), but it only gets the most recent links.
How to get the rest?
from bs4 import BeautifulSoup
from selenium import webdriver
driver = webdriver.Firefox(executable_path=r'/usr/local/bin/geckodriver')
driver.get(url)
element = driver.find_element_by_xpath('//*[#id="lazyload-container"]')
element = element.get_attribute('innerHTML')
The data is loaded via JavaScript from another URL. You can use this example how to load the releases from different years:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(a.find_previous(class_="date").text, a.text)
Prints:
25 April 1997 "EUR" - the new currency code for the euro
1 July 1997 Change of presidency of the European Monetary Institute
2 July 1997 The security features of the euro banknotes
2 July 1997 The EMI's mandate with respect to banknotes
...
17 February 2022 Financial statements of the ECB for 2021
21 February 2022 Survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD) - December 2021
21 February 2022 Results of the December 2021 survey on credit terms and conditions in euro-denominated securities financing and over-the-counter derivatives markets (SESFOD)
EDIT: To print links:
import requests
from bs4 import BeautifulSoup
url = "https://www.ecb.europa.eu/press/pr/date/{}/html/index_include.en.html"
for year in range(1997, 2023):
soup = BeautifulSoup(requests.get(url.format(year)).content, "html.parser")
for a in soup.select(".title a")[::-1]:
print(
a.find_previous(class_="date").text,
a.text,
"https://www.ecb.europa.eu" + a["href"],
)
Prints:
...
15 December 1999 Monetary policy decisions https://www.ecb.europa.eu/press/pr/date/1999/html/pr991215.en.html
20 December 1999 Visit by the Finnish Prime Minister https://www.ecb.europa.eu/press/pr/date/1999/html/pr991220.en.html
...

Get year from unknown date format using python

So I am querying a server for specific data, and I need to extract the year, from the date field returned back, however the date field varies for example:
2009
2009-10-8
2009-10
2017-10-22
2017-10
The obvious would be to extract the date into a array and fetch the max: (but there is a problem)
year = max(d.split('-'))
for some reason this gives out false positives as 22 seems to be max verses 2017, also if future calls to the server result in the date being stored as "2019/10/20" this will bring forth issues as well.
The problem is that, while 2017 > 22, '2017' < '22' because it's a string comparison. You could do this to resolve that:
year = max(map(int, d.split('-')))
But instead, if you don't mind being frowned upon by the Long Now Foundation, consider using a regular expression to extract any 4-digit number:
match = re.search(r'\b\d{4}\b', d)
if match:
year = int(match.group(0))
I would use the python-dateutil library to easily extract the year from a date string:
from dateutil.parser import parse
dates = ['2009', '2009-10-8', '2009-10']
for date in dates:
print(parse(date).year)
Output:
2009
2009
2009

BeautifulSoup Python Extracting Tag Title For Specific Tags With Attribute

I'm working on a scraper using beautifulsoup to pull concert information for certain artists on songkick. the url I'm working with is here https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page=1. I've been able to extract all artist, venue, city, and state info, the only thing I'm having trouble with is extracting the date of concerts.
In looking at the html elements, I see that the dates for shows are listed as the li title="Saturday 01 February 2020" values for example the children under ul class="event-listings". A method I was attempting to perform was extracting the time datetime values that are nensted under the li titles, but my output included the entire html markup for each li time datetime instead of just the datetime. I'm looking to either extract the li titles or the time datetime values. These li's don't have a class either.
Here is some of my code
import requests
from bs4 import BeautifulSoup as bs4
pages=[]
artists=[]
venues=[]
dates=[]
cities=[]
states=[]
pages_to_scrape=1
for i in range(1, pages_to_scrape+1):
url = 'https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page={}'.format(i)
pages.append(url)
for item in pages:
page = requests.get(item)
soup = bs4(page.text, 'html.parser')
for m in soup.findAll('li', title=True):
date = m.find('time')
print(date)
Output:
<time datetime="2020-02-01T20:00:00-0800"></time>
<time datetime="2020-02-01T20:00:00-0800"></time>
<time datetime="2020-02-01T19:00:00-0800"></time>
<time datetime="2020-02-01T19:00:00-0800"></time>
<time datetime="2020-02-01T21:00:00-0800"></time>
etc...
Looking for output like this:
2020-02-01
2020-02-01
2020-02-01
etc...
Or if able to grab the title values of li's some how output like this:
Saturday 01 February 2020
Saturday 01 February 2020
Saturday 01 February 2020
Saturday 01 February 2020
etc...
I'm curious if I'm able to split at the " for the time datetime, but since it's not text I don't think that's possible. Also, I don't want to grab the first li class = "with-date" as that is just the headline for dates for the page as to why I'm not just grabbing all li's.
Try m.find('time')['datetime'] instead of m.find('time')
Here's a way to achieve this:
import requests
from bs4 import BeautifulSoup
page = requests.get("https://www.songkick.com/metro-areas/17835-us-los-angeles-la/february-2020?page=1")
soup = BeautifulSoup(p.content, "html.parser")
tags = soup.find_all("time")
[t["datetime"].split("T")[0] for t in tags]
Notes:
I'm quite sure that crawling Songkick in this way violates their terms and conditions.
You might consider using their API, which works well: https://www.songkick.com/developer

Using regex separators with read_csv() in python?

I have a lot of csv files formated as such:
date1::tweet1::location1::language1
date2::tweet2::location2::language2
date3::tweet3::location3::language3
and so on. Some files contain up to 200 000 tweets. I want to extract 4 fields and put them in a pandas dataframe, as well as count the number of tweets. Here's the code I'm using for now:
try:
data = pd.read_csv(tweets_data_path, sep="::", header = None, engine='python')
data.columns = ["timestamp", "tweet", "location", "lang"]
print 'Number of tweets: ' + str(len(data))
except BaseException, e :
print 'Error: ',str(e)
I get the following error thrown at me
Error: expected 4 fields in line 4581, saw 5
I tried setting error_bad_lines = False, manually deleting the lines that make the program bug, setting nrows to a lower number.. and still get those "expected fields" errors for random lines. Say I delete the bottom half of the file, I will get the same error but for line 1787. Which doesn't make sense to me as it was processed correctly before. Visually inspecting the csv files doesn't reveal abornmal patterns that suddenly appear in the buggy line either.
The date fields and tweets contain colons, urls and so on so perhaps regex would make sense?
Can someone help me figure out what I'm doing wrong? Many thanks in advance!
Sample of the data as requested below:
Fri Apr 22 21:41:03 +0000 2016::RT #TalOfer: Barack Obama: Brexit would put UK back of the queue for trade talks [short url] #EuRef #StrongerIn::United Kingdom::en
Fri Apr 22 21:41:07 +0000 2016::RT #JamieRoss7: It must be awful to strongly believe in Brexit and be watching your campaigns make an absolute horse's arse of it.::The United Kingdom::en
Fri Apr 22 21:41:07 +0000 2016::Whether or not it rains on June 23rd will have more influence on the vote than Obama's lunch with the Queen and LiGA with George. #brexit.::Dublin, Ireland::en
Fri Apr 22 21:41:08 +0000 2016::FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexit vote would send UK to 'back of trade queue' #skypapers [short url]::Mardan, Pakistan::en
Start with this:
pd.read_csv(tweets_data_path, sep="::", header = None, usecols = [0,1,2,3])
The above should bring in 4 columns, then you can figure out how many lines were dropped, and if the data makes sense.
Use this pattern:
data["lang"].unique()
Since, you have problem with data and do not where it is. You need to step back and use python 'csv reader'. This should get you started.
import csv
reader = csv.reader(tweets_data_path)
tweetList = []
for row in reader:
try:
tweetList.append( (row[0].split('::')) )
except BaseException, e :
print 'Error: ',str(e)
print tweetList
tweetsDf = pd.DataFrame(tweetList)
print tweetsDf
0 \
0 Fri Apr 22 21:41:03 +0000 2016
1 Fri Apr 22 21:41:07 +0000 2016
2 Fri Apr 22 21:41:07 +0000 2016
3 Fri Apr 22 21:41:08 +0000 2016
1 2 3
0 RT #TalOfer: Barack Obama: Brexit would put UK... United Kingdom en
1 RT #JamieRoss7: It must be awful to strongly b... The United Kingdom en
2 Whether or not it rains on June 23rd will hav... Dublin None
3 FINANCIAL TIMES FRONT PAGE: 'Obama warns Brexi... Mardan None
Have you tried read_table instead? I've got this kind of error when I tried to use read_csv before and I solved the problem by using it. Please refer to this post, this might give you some ideas about how to solve the error. And maybe also try sep=r":{2}" as delimiter.

Parsing Environment Canada Website

I am trying to scrape the weather forecast from "https://weather.gc.ca/city/pages/ab-52_metric_e.html". With the code below I am able to get the table containing the data but I'm stuck. During the day the second row contains Today's forecast and the third row contains tonight's forecast. At the end of the day the second row becomes Tonight's forecast and Today's forecast is dropped. What I want to do is parse through the table to get the forecast for Today, Tonight, and each continuing day even if Today's forecast is missing; something like this:
Today: A mix of sun and cloud. 60 percent chance of showers this afternoon with risk of a thunderstorm. Widespread smoke. High 26. UV index 6 or high.
Tonight: Partly cloudy. Becoming clear this evening. Increasing cloudiness before morning. Widespread smoke. Low 13.
Friday: Mainly cloudy. Widespread smoke. Wind becoming southwest 30 km/h gusting to 50 in the afternoon. High 24.
#using Beautiful Soup 3, Python 2.6
from BeautifulSoup import BeautifulSoup
import urllib
pageFile = urllib.urlopen("https://weather.gc.ca/city/pages/ab- 52_metric_e.html")
pageHtml = pageFile.read()
pageFile.close()
soup = BeautifulSoup("".join(pageHtml))
data = soup.find("div", {"id": "mainContent"})
forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md textforecast hidden-xs"})
You could do something like iterate over each line in the table and get the value of the rows. An example would be:
forecast = data.find('table',{'class':"table mrgn-bttm-md mrgn-tp-md textforecast hidden-xs"}).find_all("tr")
for tr in forecast[1:]:
print " ".join(tr.text.split())
With this approach you get the contents of each lines (exclusive the first one which is some header.

Categories