API instead of BeautifulSoup?

API instead of BeautifulSoup? - python

This document defines some URL's and IP's of MS services:
https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online
My goal is to write a Python script that check what is the last updated date of this document.
If the date is change (means that some IP's changed), I need to know it immediately. I can't found any API for this goal, so I wrote this script:
from bs4 import BeautifulSoup
import requests
import time
import re
url = "https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online"
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
while True:
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
last_update_fra = soup.find(string=re.compile("01/04/2021"))
time.sleep(60)
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
if soup.find(string=re.compile("01/04/2021")) == last_update_fra:
print(last_update_fra)
continue
else:
#send an email for notification
pass
I'm not sure if this is the best way to do it. since if the date will change, I also need to update my script to another date (the updated date).
In addition, this is ok to do it with BeautifulSoup? or there's another and a better way?

Beautifulsoup is fine here. I don't even see an XHR request with the data anayway there.
Couple things I noted:
Do you really want/need it checked every minute? Maybe better every day/24 hours, or ever 12 or 6 hours?
If at any point it crashes, Ie you lose internet connection or get a 400 response, you'll need to restart the script and lose whatever the last date was. So maybe consider a) writing the date to disk somewhere so it's not just stored in memory, and b) put in some try/exceptions in there so that if it does encounter an error, it'll keep running and just try again at the next interval (or how ever you decide it to try again).
Code:
import requests
import time
from bs4 import BeautifulSoup
url = 'https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online'
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
last_update_fra = ''
while True:
time.sleep(60)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
found_date = soup.find('time').text
if found_date == last_update_fra:
print(last_update_fra)
continue
else:
# store new date
last_update_fra = found_date
#send an email for notification
pass

Related

BeautifulSoup Detect Change Trigger

I have a script that uses bs4 to scrape a webpage and grab a string named, "Last Updated 4/3/2020, 8:28 p.m.". I then assign this string to a variable and send it in an email. The script is scheduled to run once a day. However, the date and time on the website change every other day. So, instead of emailing every time I run the script I'd like to set up a trigger so that it sends only when the date is different. How do I configure the script to detect that change?
String in HTML:
COVID-19 News Updates
Last Updated 4/3/2020, 12:08 p.m.
'''Checks municipal websites for changes in meal assistance during C19 pandemic'''
# Import requests (to download the page)
import requests
# Import BeautifulSoup (to parse what we download)
from bs4 import BeautifulSoup
import re
#list of urls
urls = ['http://www.vofil.com/covid19_updates']
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#download the homepage
response = requests.get(urls[0], headers=headers)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(response.text, "lxml")
#Find string
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
print(last_update_fra)
#Put something here (if else..?) to trigger an email.
#I left off email block...
The closet answer I found was this, but it's referring to the tags, not a string.
Parsing changing tags BeautifulSoup

You would need to check every so often for a change using something along the lines of this:
'''Checks municipal websites for changes in meal assistance during C19 pandemic'''
# Import requests (to download the page)
import requests
# Import BeautifulSoup (to parse what we download)
from bs4 import BeautifulSoup
import re
import time
#list of urls
urls = ['http://www.vofil.com/covid19_updates']
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
while True:
#download the homepage
response = requests.get(urls[0], headers=headers)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(response.text, "lxml")
#Find string
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
time.sleep(60)
#download request again
soup = BeautifulSoup(requests.get(urls[0], headers=headers), "lxml")
if soup.findAll(string=re.compile("Last Updated")) == last_update_fra:
continue
else:
# Code to send email
*This waits a minute before checking for changes, but it can be adjusted as needed.
[Credit][1]: https://chrisalbon.com/python/web_scraping/monitor_a_website/

Python crawler wont remove links from queue

I'm just learning Python so forgive my poor coding.
I am just trying to create a website crawler which will eventually create a sitemap, report broken links etc. But right at the beginning, I am getting stuck because when creating a queue of links to crawl, as they get crawled, I want them to be removed from the queue list. But for some reason, there are duplicate URLs in the crawled list as there are in the queue list. I am guessing there is something wrong with the loop perhaps, but I am not sure.
Any help will be greatly appreciated.
my linkfinder file looks like this;
And then in my main file I am simply calling two functions in this order;
find_page_links(url)
crawler(url)
from urllib import parse
from urllib.parse import urljoin
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import urllib.request
import re
from urllib.parse import urlparse
from general import file_to_set, set_to_file
queue_file = 'savvy/queue.txt'
crawled_file = 'savvy/crawled.txt'
def find_page_links(page_url):
crawled = file_to_set(crawled_file)
queued = file_to_set(queue_file)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req = urllib.request.Request(page_url, headers=headers)
response = urllib.request.urlopen(req)
res = BeautifulSoup(response.read(),"html.parser")
for link in res.find_all('a'):
aLink = urljoin(page_url,link.get('href'))
if page_url in aLink:
queued.add(aLink)
crawled.add(page_url)
queued.remove(page_url)
set_to_file(crawled, crawled_file)
set_to_file(queued, queue_file)
return queued
def crawler(base_url):
crawled = file_to_set(crawled_file)
queued = file_to_set(queue_file)
for link in queued.copy():
if link not in crawled:
find_page_links2(base_url, link, queued, crawled)
else:
queued.remove(link)
def find_page_links2(base_url, page_url, queued, crawled):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req = urllib.request.Request(page_url, headers=headers)
response = urllib.request.urlopen(req)
res = BeautifulSoup(response.read(),"html.parser")
for link in res.find_all('a'):
aLink = urljoin(page_url,link.get('href'))
if base_url in aLink:
queued.add(aLink)
crawled.add(page_url)
queued.remove(page_url)
set_to_file(crawled, crawled_file)
set_to_file(queued, queue_file)

I think the reason you get duplicates is because you simply check the html for and place that link inside your queue. The problem here is that there might be several 's that link to the same page.
An easy fix might be to simply get all links (just as you do now) and then remove duplicates before you start crawling.
Else you can check if the link is already in the queue and then only add new links?

Can't open an URL in Python without opened it in browser first

I want to make a bot that tweet NBA Scores every day. So I need to get the NBA Scores from the stats.nba website every day.
The problem is if I don't click on the JSON link and access it with my browser before trying to open it in my code it doesn't work. There is a new link every day for the matchs of the night.
Does anyone know how to solve that ?
Thank you

It would be interesting to see your code and figure out why it needs to be opened in the browser first, but if that really is the case:
Just open it with webbrowser first:
import webbrowser
webbrowser.open_new_tab(url)
# rest of your logic below.
This will open the url in your systems default browser.
You could also check if you're missing some flags such as allowing for redirects or if you need an user-agent (so it looks like you're visiting from a browser)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get(url, headers=headers, allow_redirects = True)
response.raise_for_status()
content = response.text

How do I send an embed message that contains multiple links parsed from a website to a webhook?

I want my embed message to look like this, but mine only returns one link.
Here's my code:
import requests
from bs4 import BeautifulSoup
from discord_webhook import DiscordWebhook, DiscordEmbed
url = 'https://www.solebox.com/Footwear/Basketball/Lebron-X-JE-Icon-QS-variant.html'
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'}
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links = "https://www.solebox.com/{0}".format(aid)
webhook = DiscordWebhook(url='WebhookURL')
embed = DiscordEmbed(title='Title')
embed.set_author(name='Brand')
embed.set_thumbnail(url="Image")
embed.set_footer(text='Footer')
embed.set_timestamp()
embed.add_embed_field(name='Sizes', value='US{0}'.format(size))
embed.add_embed_field(name='Links', value='[Links]({0})'.format(product_links))
webhook.add_embed(embed)
webhook.execute()

This will most likely get you the results you want. type(product_links) is a string, meaning that every iteration in your for loop is just re-writing over the product_links variable with a new string. If you declare a List before the loop and append product_links to that list, it will most likely result in what you wanted.
Note: I had to use a different URL from that site. The one specified in the question was no longer available. I also had to use a different header as the one the asker put up continuously fed me a 403 error.
Additional Note: The URLS that are returned via your code logic return links that lead to no where. I feel that you'll need to work that one through since I don't know what you're exactly trying to do, however I feel that this answers the question of why you where only getting one link.
import requests
from bs4 import BeautifulSoup
url = 'https://www.solebox.com/Footwear/Basketball/Air-Force-1-07-PRM-variant-2.html'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.3"}
r = requests.get(url=url, headers=headers)
soup = BeautifulSoup(r.content, "lxml")
product_links = [] # Create our product
for tag in soup.find_all('a', class_="selectSize"):
#There's multiple 'id' resulting in more than one link
aid = tag.get('id')
#There's also multiple sizes
size = tag.get('data-size-us')
#These are the links that need to be shown in the embed message
product_links.append("https://www.solebox.com/{0}".format(aid))

Webscraping - Inconsitent result due to Timeout request

I am scraping some public data from website using Python 3.6.
I created a long list of pages URLs I need to scrape (10k+).
I parse each one and produce a list with all the relevant information and than I append this to a comprehensive list.
I used to get some timeout request errors so I tried to handle it using try/except.
The code runs without apparent errors but, re-running the code I get very inconsistent results: the length of the final list changes substantially and I can prove that non all the pages have been parsed.
So my code shuts down at some point and I cannot check at what point.
the time_out variable is always zero at the end, no matter how long is the list produced.
Any help appreciated!
Best
Here is what I believe is the relevant part of the code
import requests
from bs4 import BeautifulSoup
headers = {'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'}
LIST_OF_URLS = ['URL','URL','URL']
FINAL_LIST = []
timed_out = 0
for URL in LIST_OF_URLS:
try:
result_page = BeautifulSoup(requests.get(URL, headers=headers,timeout=10).text, 'html.parser')
except requests.exceptions.Timeout:
timed_out+=1
#The loop produces a LIST
FINAL_LIST.append(LIST)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

API instead of BeautifulSoup? - python

Related

BeautifulSoup Detect Change Trigger

Python crawler wont remove links from queue

Can't open an URL in Python without opened it in browser first

How do I send an embed message that contains multiple links parsed from a website to a webhook?

Webscraping - Inconsitent result due to Timeout request

Categories

Resources