BeautifulSoup Detect Change Trigger - python

I have a script that uses bs4 to scrape a webpage and grab a string named, "Last Updated 4/3/2020, 8:28 p.m.". I then assign this string to a variable and send it in an email. The script is scheduled to run once a day. However, the date and time on the website change every other day. So, instead of emailing every time I run the script I'd like to set up a trigger so that it sends only when the date is different. How do I configure the script to detect that change?
String in HTML:
COVID-19 News Updates
Last Updated 4/3/2020, 12:08 p.m.
'''Checks municipal websites for changes in meal assistance during C19 pandemic'''
# Import requests (to download the page)
import requests
# Import BeautifulSoup (to parse what we download)
from bs4 import BeautifulSoup
import re
#list of urls
urls = ['http://www.vofil.com/covid19_updates']
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#download the homepage
response = requests.get(urls[0], headers=headers)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(response.text, "lxml")
#Find string
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
print(last_update_fra)
#Put something here (if else..?) to trigger an email.
#I left off email block...
The closet answer I found was this, but it's referring to the tags, not a string.
Parsing changing tags BeautifulSoup

You would need to check every so often for a change using something along the lines of this:
'''Checks municipal websites for changes in meal assistance during C19 pandemic'''
# Import requests (to download the page)
import requests
# Import BeautifulSoup (to parse what we download)
from bs4 import BeautifulSoup
import re
import time
#list of urls
urls = ['http://www.vofil.com/covid19_updates']
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
while True:
#download the homepage
response = requests.get(urls[0], headers=headers)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(response.text, "lxml")
#Find string
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
time.sleep(60)
#download request again
soup = BeautifulSoup(requests.get(urls[0], headers=headers), "lxml")
if soup.findAll(string=re.compile("Last Updated")) == last_update_fra:
continue
else:
# Code to send email
*This waits a minute before checking for changes, but it can be adjusted as needed.
[Credit][1]: https://chrisalbon.com/python/web_scraping/monitor_a_website/

Related

Parsing text with bs4 works with selenium but does not work with requests in Python

This code works and returns the single digit number that i want but its so slow and takes good 10 seconds to complete.I will be running this 4 times for my use so thats 40 seconds wasted every run.
` from selenium import webdriver
from bs4 import BeautifulSoup
options = webdriver.FirefoxOptions()
options.add_argument('--headless')
driver = webdriver.Firefox(options=options)
driver.get('https://warframe.market/items/ivara_prime_blueprint')
html = driver.page_source
soup = BeautifulSoup(html, 'html.parser')
price_element = soup.find('div', {'class': 'row order-row--Alcph'})
price2=price_element.find('div',{'class':'order-row__price--hn3HU'})
price = price2.text
print(int(price))
driver.close()`
This code on the other hand does not work. It returns None.
` import requests
from bs4 import BeautifulSoup
url='https://warframe.market/items/ivara_prime_blueprint'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, 'html.parser')
price_element=soup.find('div', {'class': 'row order-row--Alcph'})
price2=price_element.find('div',{'class':'order-row__price--hn3HU'})
price = price2.text
print(int(price))`
First thought was to add user agent but still did not work. When I print(soup) it gives me html code but when i parse it further it stops and starts giving me None even tho its the same command like in selenium example.
The data is loaded dynamically within a <script> tag so Beautifulsoup doesn't see it (it doesn't render Javascript).
As an example, to get the data, you can use:
import json
import requests
from bs4 import BeautifulSoup
url = "https://warframe.market/items/ivara_prime_blueprint"
headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"
}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")
script_tag = soup.select_one("#application-state")
json_data = json.loads(script_tag.string)
# Uncomment the line below to see all the data
# from pprint import pprint
# pprint(json_data)
for data in json_data["payload"]["orders"]:
print(data["user"]["ingame_name"])
Prints:
Rogue_Monarch
Rappei
KentKoes
Tenno61189
spinifer14
Andyfr0nt
hollowberzinho
You can access the data as a dict and acess the keys/values.
I'd recommend an online tool to view all the JSON since it's quite large.
See also
Parsing out specific values from JSON object in BeautifulSoup

webscraping python not showing all tags

I'm new to webscraping. I was trying to make a script that gets data from a balance sheet (here the site: https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm). The problem is getting the data: when I watch at the source code in my browser, I'm able to find the tag and the correct value. Once I write down a script with bs4, I don't get anything.
I'm trying to get informations form the balance sheet: Products, Services, Cost of sales... and the data contained in the table 1. (I'm sorry, but I can't post the image. Anyway is the first table you see scrolling down).
Here's my code.
from bs4 import BeautifulSoup
from urllib.request import Request, urlopen
url = "https://www.sec.gov/ix?doc=/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
req = Request(url, headers={'User-Agent': 'Mozilla/5.0'})
read_data = urlopen(req).read()
soup_data = BeautifulSoup(read_data,"lxml")
names = soup_data.find_all("td")
for name in names:
print(name)
Thanks for your time.
Try this URL:
Also include the headers to get the data.
import requests
from bs4 import BeautifulSoup
url = "https://www.sec.gov/Archives/edgar/data/320193/000032019320000010/a10-qq1202012282019.htm"
headers = {"User-agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36"}
req = requests.get(url, headers=headers)
soup_data = BeautifulSoup(req.text,"lxml")
You will be able to find the data you need.

API instead of BeautifulSoup?

This document defines some URL's and IP's of MS services:
https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online
My goal is to write a Python script that check what is the last updated date of this document.
If the date is change (means that some IP's changed), I need to know it immediately. I can't found any API for this goal, so I wrote this script:
from bs4 import BeautifulSoup
import requests
import time
import re
url = "https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online"
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
while True:
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
last_update_fra = soup.find(string=re.compile("01/04/2021"))
time.sleep(60)
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
if soup.find(string=re.compile("01/04/2021")) == last_update_fra:
print(last_update_fra)
continue
else:
#send an email for notification
pass
I'm not sure if this is the best way to do it. since if the date will change, I also need to update my script to another date (the updated date).
In addition, this is ok to do it with BeautifulSoup? or there's another and a better way?
Beautifulsoup is fine here. I don't even see an XHR request with the data anayway there.
Couple things I noted:
Do you really want/need it checked every minute? Maybe better every day/24 hours, or ever 12 or 6 hours?
If at any point it crashes, Ie you lose internet connection or get a 400 response, you'll need to restart the script and lose whatever the last date was. So maybe consider a) writing the date to disk somewhere so it's not just stored in memory, and b) put in some try/exceptions in there so that if it does encounter an error, it'll keep running and just try again at the next interval (or how ever you decide it to try again).
Code:
import requests
import time
from bs4 import BeautifulSoup
url = 'https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online'
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
last_update_fra = ''
while True:
time.sleep(60)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
found_date = soup.find('time').text
if found_date == last_update_fra:
print(last_update_fra)
continue
else:
# store new date
last_update_fra = found_date
#send an email for notification
pass

Requests in python return error, while opening link manually works perfect

import requests
a = 'http://tmsearch.uspto.gov/bin/showfield?f=toc&state=4809%3Ak1aweo.1.1&p_search=searchstr&BackReference=&p_L=100&p_plural=no&p_s_PARA1={}&p_tagrepl%7E%3A=PARA1%24MI&expr=PARA1+or+PARA2&p_s_PARA2=&p_tagrepl%7E%3A=PARA2%24ALL&a_default=search&f=toc&state=4809%3Ak1aweo.1.1&a_search=Submit+Query'
a = a.format('coca-cola')
b = requests.get(a)
print(b.text)
print(b.url)
If you copy the printed url and paste it in browser, site will open with no problem, but if you do requests.get, i get some token? errors. Is there anything I can do?
VIA requests.get I url back, but no data if doing manually. It says: <html><head><TITLE>TESS -- Error</TITLE></head><body>
First of all, make sure you follow the website's Terms of Use and usage policies.
This is a little bit more complicated that it may seem. You need to maintain a certain state throughout the [web-scraping session][1]. And, you'll need an HTML parser, like BeautifulSoup along the way:
from urllib.parse import parse_qs, urljoin
import requests
from bs4 import BeautifulSoup
SEARCH_TERM = 'coca-cola'
with requests.Session() as session:
session.headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.133 Safari/537.36'}
# get the current search state
response = session.get("https://tmsearch.uspto.gov/")
soup = BeautifulSoup(response.content, "html.parser")
link = soup.find("a", text="Basic Word Mark Search (New User)")["href"]
session.get(urljoin(response.url, link))
state = parse_qs(link)['state'][0]
# perform a search
response = session.post("https://tmsearch.uspto.gov/bin/showfield", data={
'f': 'toc',
'state': state,
'p_search': 'search',
'p_s_All': '',
'p_s_ALL': SEARCH_TERM + '[COMB]',
'a_default': 'search',
'a_search': 'Submit'
})
# print search results
soup = BeautifulSoup(response.content, "html.parser")
print(soup.find("font", color="blue").get_text())
table = soup.find("th", text="Serial Number").find_parent("table")
for row in table('tr')[1:]:
print(row('td')[1].get_text())
It prints all the serial number values from the first search results page, for demonstration purposes.

Content missing in html file scraping with BeautifulSoup

i am trying to scrape the content of this site: http://www.whoscored.com/Matches/824609/Live . If i view the html-file in Chrome under the network tab am i able to see everything i want to scrape. But if i run the script below is a lot of data is missing in the results, data in JSON format. Its data of every specific event during the game. Why is the result different when i inspect the content in chrome and when i scrape the site?
import requests
from bs4 import BeautifulSoup
url = 'http://www.whoscored.com/Matches/824609/Live'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/36.0.1985.125 Safari/537.36'}
response = requests.get(url, headers=headers)
soup = BeautifulSoup(response.text)
print soup

Categories