Extract date from multiple webpages with Python - python

I want to extract date when news article was published on websites. For some websites I have exact html element where date/time is (div, p, time) but on some websites I do not have:
These are the links for some websites (german websites):
(3 Nov 2020)
http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226
(Dec. 1, 2020) http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0&sq=&kategorie_id=&date_from=&date_to=
(10/22/2020) http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905
I have tried 3 different solutions with Python libs such as requests, htmldate and date_guesser but I'm always getting None, or in case of htmldate lib, I always get same date (2020.1.1)
from bs4 import BeautifulSoup
import requests
from htmldate import find_date
from date_guesser import guess_date, Accuracy
# Lib find_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
my_date = find_date(response.content, extensive_search=True)
print(my_date, '\n')
# Lib guess_date
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
my_date = guess_date(url=url, html=requests.get(url).text)
print(my_date.date, '\n')
# Lib Requests # I DO NOT GET last modified TAG
my_date = requests.head('http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226')
print(my_date.headers, '\n')
Am I doing something wrong?
Can you please tell me is there a way to extract date of publication from websites like this (where I do not have specific divs, p, and datetime elements).
IMPORTANT!
I want to make universal date extraction, so that I can put these links in for loop and run the same function to them.

I have never had much success with some of the date parsing libraries, so I usually go another route. I believe that the best method to extract the date strings from these sites in your question is with regular expressions.
website: linden.ch
import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime
url = "http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[1], '%d. %b. %Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
03-11-2020
website: buchholterberg.ch
import requests
import re as regex
from bs4 import BeautifulSoup
from datetime import datetime
url = "http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905"
response = requests.get(url)
soup = BeautifulSoup(response.content, 'html.parser')
page_body = soup.find('body')
find_date = regex.search(r'(Veröffentlicht)\s\w+:\s(\d{1,2}:\d{1,2}:\d{1,2})\s(\d{1,2}.\d{1,2}.\d{4})', str(page_body))
reformatted_timestamp = datetime.strptime(find_date.groups()[2], '%d.%m.%Y').strftime('%d-%m-%Y')
print(reformatted_timestamp)
# print output
22-10-2020
Update 12-04-2020
I looked at the source code for the two Python libraries: htmldate and date_guesser that you mentioned. Neither of these libraries can currently extract the date from the 3 sources that you listed in your question. The primary reason for this lack of extraction is linked to the date formats and language (german) of these target sites.
I had some free time so I put this together for you. The answer below can easily be modified to extract from any website and can be refined as needed based on the format of your target sources. It currently extract from all the links contained in URLs.
all urls
import requests
import re as regex
from bs4 import BeautifulSoup
def extract_date(can_of_soup):
page_body = can_of_soup.find('body')
clean_body = ''.join(str(page_body).replace('\n', ''))
if 'Datum der Neuigkeit' in clean_body or 'Veröffentlicht' in clean_body:
date_formats = '(Datum der Neuigkeit)\s(\d{1,2}\W\s\w+\W\s\d{4})|(Veröffentlicht am: \d{2}:\d{2}:\d{2} )(\d{1,2}.\d{1,2}.\d{4})'
find_date = regex.search(date_formats, clean_body, regex.IGNORECASE)
if find_date:
clean_tuples = [i for i in list(find_date.groups()) if i]
return ''.join(clean_tuples[1])
else:
tags = ['extra', 'elementStandard elementText', 'icms-block icms-information-date icms-text-gemeinde-color']
for tag in tags:
date_tag = page_body.find('div', {'class': f'{tag}'})
if date_tag is not None:
children = date_tag.findChildren()
if children:
find_date = regex.search(r'(\d{1,2}.\d{1,2}.\d{4})', str(children))
return ''.join(find_date.groups())
else:
return ''.join(date_tag.contents)
def get_soup(target_url):
response = requests.get(target_url)
soup = BeautifulSoup(response.content, 'html.parser')
return soup
urls = {'http://www.linden.ch/de/aktuelles/aktuellesinformationen/?action=showinfo&info_id=1074226',
'http://www.reutigen.ch/de/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id=1066837&ls=0'
'&sq=&kategorie_id=&date_from=&date_to=',
'http://buchholterberg.ch/de/Gemeinde/Information/News/Newsmeldung?filterCategory=22&newsid=905',
'https://www.steffisburg.ch/de/aktuelles/meldungen/Hochwasserschutz-und-Laengsvernetzung-Zulg.php',
'https://www.wallisellen.ch/aktuellesinformationen/924227',
'http://www.winkel.ch/de/aktuellesre/aktuelles/aktuellesinformationen/welcome.php?action=showinfo&info_id'
'=1093910&ls=0&sq=&kategorie_id=&date_from=&date_to=',
'https://www.aeschi.ch/de/aktuelles/mitteilungen/artikel/?tx_news_pi1%5Bnews%5D=87&tx_news_pi1%5Bcontroller%5D=News&tx_news_pi1%5Baction%5D=detail&cHash=ab4d329e2f1529d6e3343094b416baed'}
for url in urls:
html = get_soup(url)
article_date = extract_date(html)
print(article_date)

Related

Using multiple for loop with Python Using Beautiful Soup

from bs4 import BeautifulSoup as bs
import requests
import pandas as pd
url = "https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/109825373"
data = requests.get(url)
soup = bs(data.content,"html.parser")
The code below are a test with to get 1 item.
property_overview = soup.find(class_="p24_regularListing").find(class_="p24_propertyOverview").find(class_='p24_propertyOverviewRow').find(class_='col-xs-6 p24_propertyOverviewKey').text
property_overview
Output : 'Listing Number'
The code below is what we have to get all the col-xs-6 p24_propertyOverviewKey
p24_regularListing_items = soup.find_all(class_="p24_regularListing")
for p24_propertyOverview_item in p24_regularListing_items:
p24_propertyOverview_items = p24_propertyOverview_item.find_all(class_="p24_propertyOverview")
for p24_propertyOverviewRow_item in p24_propertyOverview_items:
p24_propertyOverviewRow_items = p24_propertyOverviewRow_item.find_all(class_="p24_propertyOverviewRow")
for p24_propertyOverviewKey_item in p24_propertyOverviewRow_items:
p24_propertyOverviewKey_items = p24_propertyOverviewKey_item.find_all(class_="col-xs-6 p24_propertyOverviewKey")
p24_propertyOverviewKey_items
The code above only outputs 1 item. and not all
To put things more simply, you can use soup.select() (and via the comments, you can then use .get_text() to extract the text from each tag).
from bs4 import BeautifulSoup
import requests
resp = requests.get(
"https://www.property24.com/for-sale/woodland-hills-wildlife-estate/bloemfontein/free-state/10467/109825373"
)
resp.raise_for_status()
soup = BeautifulSoup(resp.content, "html.parser")
texts = []
for tag in soup.select(
# NB: this selector uses Python's implicit string concatenation
# to split it onto several lines.
".p24_regularListing "
".p24_propertyOverview "
".p24_propertyOverviewRow "
".p24_propertyOverviewKey"
):
texts.append(tag.get_text())
print(texts)

how to use python to parse a html that is in txt format?

I am trying to parse a txt, example as below link.
The txt, however, is in the form of html. I am trying to get "COMPANY CONFORMED NAME" which located at the top of the file, and my function should return "Monocle Acquisition Corp".
https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt
I have tried below:
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
However, "soup" does not contain "COMPANY CONFORMED NAME" at all.
Can someone point me in the right direction?
The data you are looking for is not in an HTML structure so Beautiful Soup is not the best tool. The correct and fast way of searching for this data is just using a simple Regular Expression like this:
import re
import requests
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
text_string = r.content.decode()
name_re = re.compile("COMPANY CONFORMED NAME:[\\t]*(.+)\n")
match = name_re.search(text_string).group(1)
print(match)
the part you look like is inside a huge tag <SEC-HEADER>
you can get the whole section by using soup.find('sec-header')
but you will need to parse the section manually, something like this works, but it's some dirty job :
(view it in replit : https://repl.it/#gui3/stackoverflow-parsing-html)
import requests
from bs4 import BeautifulSoup
url = 'https://www.sec.gov/Archives/edgar/data/1754170/0001571049-19-000004.txt'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html")
header = soup.find('sec-header').text
company_name = None
for line in header.split('\n'):
split = line.split(':')
if len(split) > 1 :
key = split[0]
value = split[1]
if key.strip() == 'COMPANY CONFORMED NAME':
company_name = value.strip()
break
print(company_name)
There may be some library able to parse this data better than this code

How to scrape data from interactive chart using python?

I have a next link which represent an exact graph I want to scrape: https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1
I'm simply can't understand is it a xml or svg graph and how to scrape data. I think I need to use bs4, requests but don't know the way to do that.
Anyone could help?
You will load HTML like this:
import requests
url = "https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
resp = requests.get(url)
data = resp.text
Then you will create a BeatifulSoup object with this HTML.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, features="html.parser")
After this, it is usually very subjective how to parse out what you want. The candidate codes may vary a lot. This is how I did it:
Using BeautifulSoup, I parsed all "rect"s and check if "onmouseover" exists in that rect.
rects = soup.svg.find_all("rect")
yx_points = []
for rect in rects:
if rect.has_attr("onmouseover"):
text = rect["onmouseover"]
x_start_index = text.index("'") + 1
y_finish_index = text[x_start_index:].index("'") + x_start_index
yx = text[x_start_index:y_finish_index].split()
print(text[x_start_index:y_finish_index])
yx_points.append(yx)
As you can see from the image below, I scraped onmouseover= part and get those 02.2015 155,1 parts.
Here, this is how yx_points looks like now:
[['12.2009', '100,0'], ['01.2010', '101,8'], ['02.2010', '103,7'], ...]
from bs4 import BeautifulSoup
import requests
import re
#First get all the text from the url.
url="https://index.minfin.com.ua/ua/economy/index/svg.php?indType=1&fromYear=2010&acc=1"
response = requests.get(url)
html = response.text
#Find all the tags in which the data is stored.
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll("rect")
final = []
for each in texts:
names = each.get('onmouseover')
try:
q = re.findall(r"'(.*?)'", names)
final.append(q[0])
except Exception as e:
print(e)
#The details are appended to the final variable

Removing UTF 8 encoding in python

I tried scraping the webpage for Passengers & Cargo data. I couldn't convert them into normal data, and web encoding seems to be the challenge.
The Code I used is:
from __future__ import print_function
import requests
import pandas as pd
from bs4 import BeautifulSoup
import urllib
url = "https://www.faa.gov/data_research/passengers_cargo/unruly_passengers/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("tbody")
for link in links:
print(link.text)
Output1
This prints in the format Year and Total. But when I append it to a list, the encoding ruins the data. You can see that in Output1
names = []
for link in links:
names.append(link.text)
names = map(lambda x: x.strip().encode('ascii'), names)
print(names)
Output2
The desired output should be Years and Total for me to perform analyses
You can use find_all tr and td like this:
import requests
from bs4 import BeautifulSoup
import urllib
url = "https://www.faa.gov/data_research/passengers_cargo/unruly_passengers/"
r = requests.get(url)
soup = BeautifulSoup(r.content)
links = soup.find_all("tr")
data = []
for link in links:
tds = link.find_all('td')
if tds:
data.append({'year':tds[0].text,'total':tds[1].text})
print(data)
It's worked.
Hope it helps you

Building a python web scraper, Need help to get correct output

I was building a web-scraper using python.
The purpose of my scraper is to fetch all the links to websites from this webpage http://www.ebizmba.com/articles/torrent-websites
I want output like -
www.thepiratebay.se
www.kat.ph
I am new to python and scraping, and I was doing this just for practice. Please help me to get the right output.
My code --------------------------------------
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.ebizmba.com/articles/torrent-websites")
soup = BeautifulSoup(r.content, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})
for item in data:
print(item.contents[1].find_all("a"))
My Output --- http://i.stack.imgur.com/Xi37B.png
If you are webscraping for practice, have a look at regular expressions.
This here would get just the headline links... The Needle string is the match string, the brackets (http://.*?) contain the match group.
import urllib2
import re
myURL = "http://www.ebizmba.com/articles/torrent-websites"
req = urllib2.Request(myURL)
Needle1 = '<p><a href="(http:.*?)" rel="nofollow" target="_blank">'
for match in re.finditer(Needle1, urllib2.urlopen(req).read()):
print(match.group(1))
Use .get('href') like this:
import requests
from bs4 import BeautifulSoup
r = requests.get("http://www.ebizmba.com/articles/torrent-websites")
soup = BeautifulSoup(r.text, "html.parser")
data = soup.find_all("div", {"class:", "main-container-2"})
for i in data:
for j in i.contents[1].find_all("a"):
print(j.get('href'))
Full output:
http://www.thepiratebay.se
http://siteanalytics.compete.com/thepiratebay.se
http://quantcast.com/thepiratebay.se
http://www.alexa.com/siteinfo/thepiratebay.se/
http://www.kickass.to
http://siteanalytics.compete.com/kickass.to
http://quantcast.com/kickass.to
http://www.alexa.com/siteinfo/kickass.to/
http://www.torrentz.eu
http://siteanalytics.compete.com/torrentz.eu
http://quantcast.com/torrentz.eu
http://www.alexa.com/siteinfo/torrentz.eu/
http://www.extratorrent.cc
http://siteanalytics.compete.com/extratorrent.cc
http://quantcast.com/extratorrent.cc
http://www.alexa.com/siteinfo/extratorrent.cc/
http://www.yify-torrents.com
http://siteanalytics.compete.com/yify-torrents.com
http://quantcast.com/yify-torrents.com
http://www.alexa.com/siteinfo/yify-torrents.com
http://www.bitsnoop.com
http://siteanalytics.compete.com/bitsnoop.com
http://quantcast.com/bitsnoop.com
http://www.alexa.com/siteinfo/bitsnoop.com/
http://www.isohunt.to
http://siteanalytics.compete.com/isohunt.to
http://quantcast.com/isohunt.to
http://www.alexa.com/siteinfo/isohunt.to/
http://www.sumotorrent.sx
http://siteanalytics.compete.com/sumotorrent.sx
http://quantcast.com/sumotorrent.sx
http://www.alexa.com/siteinfo/sumotorrent.sx/
http://www.torrentdownloads.me
http://siteanalytics.compete.com/torrentdownloads.me
http://quantcast.com/torrentdownloads.me
http://www.alexa.com/siteinfo/torrentdownloads.me/
http://www.eztv.it
http://siteanalytics.compete.com/eztv.it
http://quantcast.com/eztv.it
http://www.alexa.com/siteinfo/eztv.it/
http://www.rarbg.com
http://siteanalytics.compete.com/rarbg.com
http://quantcast.com/rarbg.com
http://www.alexa.com/siteinfo/rarbg.com/
http://www.1337x.org
http://siteanalytics.compete.com/1337x.org
http://quantcast.com/1337x.org
http://www.alexa.com/siteinfo/1337x.org/
http://www.torrenthound.com
http://siteanalytics.compete.com/torrenthound.com
http://quantcast.com/torrenthound.com
http://www.alexa.com/siteinfo/torrenthound.com/
https://demonoid.org/
http://siteanalytics.compete.com/demonoid.pw
http://quantcast.com/demonoid.pw
http://www.alexa.com/siteinfo/demonoid.pw/
http://www.fenopy.se
http://siteanalytics.compete.com/fenopy.se
http://quantcast.com/fenopy.se
http://www.alexa.com/siteinfo/fenopy.se/

Categories