I'm just learning Python so forgive my poor coding.
I am just trying to create a website crawler which will eventually create a sitemap, report broken links etc. But right at the beginning, I am getting stuck because when creating a queue of links to crawl, as they get crawled, I want them to be removed from the queue list. But for some reason, there are duplicate URLs in the crawled list as there are in the queue list. I am guessing there is something wrong with the loop perhaps, but I am not sure.
Any help will be greatly appreciated.
my linkfinder file looks like this;
And then in my main file I am simply calling two functions in this order;
find_page_links(url)
crawler(url)
from urllib import parse
from urllib.parse import urljoin
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import urllib.request
import re
from urllib.parse import urlparse
from general import file_to_set, set_to_file
queue_file = 'savvy/queue.txt'
crawled_file = 'savvy/crawled.txt'
def find_page_links(page_url):
crawled = file_to_set(crawled_file)
queued = file_to_set(queue_file)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req = urllib.request.Request(page_url, headers=headers)
response = urllib.request.urlopen(req)
res = BeautifulSoup(response.read(),"html.parser")
for link in res.find_all('a'):
aLink = urljoin(page_url,link.get('href'))
if page_url in aLink:
queued.add(aLink)
crawled.add(page_url)
queued.remove(page_url)
set_to_file(crawled, crawled_file)
set_to_file(queued, queue_file)
return queued
def crawler(base_url):
crawled = file_to_set(crawled_file)
queued = file_to_set(queue_file)
for link in queued.copy():
if link not in crawled:
find_page_links2(base_url, link, queued, crawled)
else:
queued.remove(link)
def find_page_links2(base_url, page_url, queued, crawled):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req = urllib.request.Request(page_url, headers=headers)
response = urllib.request.urlopen(req)
res = BeautifulSoup(response.read(),"html.parser")
for link in res.find_all('a'):
aLink = urljoin(page_url,link.get('href'))
if base_url in aLink:
queued.add(aLink)
crawled.add(page_url)
queued.remove(page_url)
set_to_file(crawled, crawled_file)
set_to_file(queued, queue_file)
I think the reason you get duplicates is because you simply check the html for and place that link inside your queue. The problem here is that there might be several 's that link to the same page.
An easy fix might be to simply get all links (just as you do now) and then remove duplicates before you start crawling.
Else you can check if the link is already in the queue and then only add new links?
Related
I've been working on a web scraper for top news sites. Beautiful Soup in python has been a great tool, letting me get full articles with very simple code. BUT
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source=session.get(article_url, headers=request_header).text
soup = BeautifulSoup(source,'lxml')
#get all <p> paragraphs from article
paragraphs=soup.find_all('p')
#print each paragraph as a line
for paragraph in paragraphs:
print(paragraph)
This works great on most news sites I've tried BUT for some reason The AP site gives me no output at all. Which is strange because the exact same code works on maybe 10 other sites like the NYT, WaPo, and The Hill. And I know why.
What it does is, where every other site prints out all the paragraphs, it prints nothing. Except when I look at the paragraphs soup variable, here is the kind of thing I see:
address the pandemic.\u003c/p>\u003cdiv class=\"ad-placeholder\">\u003c/div>\u003cp>Instead, public schools
Clearly what's happening is the < HTML symbol is being translated as \u003b. And because of that find_all('p') can't properly find the HTML tags. But for some reason only the AP site is doing it. When I inspect the AP website, their html has the same symbols as all the other sites.
Does anyone have any idea why this is happening? Or what I can do to fix it? Because I'm seriously confused
For me, at least, I had to extract a javascript object containing the data with regex, then parse with json into json object, then grab the value associated with the page html as you see it in browser, soup it and then extract the paragraphs. I removed the retries stuff; you can easily re-insert that.
import requests
#from requests.adapters import HTTPAdapter
#from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re,json
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source = requests.get(article_url, headers=request_header).text
data = json.loads(re.search(r"window\['titanium-state'\] = (.*)", source, re.M).group(1))
content = data['content']['data']
content = content[list(content.keys())[0]]
soup = BeautifulSoup(content['storyHTML'])
for p in soup.select('p'):
print(p.text.strip())
Regex:
This document defines some URL's and IP's of MS services:
https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online
My goal is to write a Python script that check what is the last updated date of this document.
If the date is change (means that some IP's changed), I need to know it immediately. I can't found any API for this goal, so I wrote this script:
from bs4 import BeautifulSoup
import requests
import time
import re
url = "https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online"
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
while True:
response = requests.get(url,headers=headers)
soup = BeautifulSoup(response.text,"html.parser")
last_update_fra = soup.find(string=re.compile("01/04/2021"))
time.sleep(60)
soup = BeautifulSoup(requests.get(url, headers=headers).text, "html.parser")
if soup.find(string=re.compile("01/04/2021")) == last_update_fra:
print(last_update_fra)
continue
else:
#send an email for notification
pass
I'm not sure if this is the best way to do it. since if the date will change, I also need to update my script to another date (the updated date).
In addition, this is ok to do it with BeautifulSoup? or there's another and a better way?
Beautifulsoup is fine here. I don't even see an XHR request with the data anayway there.
Couple things I noted:
Do you really want/need it checked every minute? Maybe better every day/24 hours, or ever 12 or 6 hours?
If at any point it crashes, Ie you lose internet connection or get a 400 response, you'll need to restart the script and lose whatever the last date was. So maybe consider a) writing the date to disk somewhere so it's not just stored in memory, and b) put in some try/exceptions in there so that if it does encounter an error, it'll keep running and just try again at the next interval (or how ever you decide it to try again).
Code:
import requests
import time
from bs4 import BeautifulSoup
url = 'https://learn.microsoft.com/en-us/microsoft-365/enterprise/urls-and-ip-address-ranges?view=o365-worldwide#exchange-online'
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
last_update_fra = ''
while True:
time.sleep(60)
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
found_date = soup.find('time').text
if found_date == last_update_fra:
print(last_update_fra)
continue
else:
# store new date
last_update_fra = found_date
#send an email for notification
pass
I have a script that uses bs4 to scrape a webpage and grab a string named, "Last Updated 4/3/2020, 8:28 p.m.". I then assign this string to a variable and send it in an email. The script is scheduled to run once a day. However, the date and time on the website change every other day. So, instead of emailing every time I run the script I'd like to set up a trigger so that it sends only when the date is different. How do I configure the script to detect that change?
String in HTML:
COVID-19 News Updates
Last Updated 4/3/2020, 12:08 p.m.
'''Checks municipal websites for changes in meal assistance during C19 pandemic'''
# Import requests (to download the page)
import requests
# Import BeautifulSoup (to parse what we download)
from bs4 import BeautifulSoup
import re
#list of urls
urls = ['http://www.vofil.com/covid19_updates']
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#download the homepage
response = requests.get(urls[0], headers=headers)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(response.text, "lxml")
#Find string
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
print(last_update_fra)
#Put something here (if else..?) to trigger an email.
#I left off email block...
The closet answer I found was this, but it's referring to the tags, not a string.
Parsing changing tags BeautifulSoup
You would need to check every so often for a change using something along the lines of this:
'''Checks municipal websites for changes in meal assistance during C19 pandemic'''
# Import requests (to download the page)
import requests
# Import BeautifulSoup (to parse what we download)
from bs4 import BeautifulSoup
import re
import time
#list of urls
urls = ['http://www.vofil.com/covid19_updates']
#set the headers as a browser
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
while True:
#download the homepage
response = requests.get(urls[0], headers=headers)
#parse the downloaded homepage and grab all text
soup = BeautifulSoup(response.text, "lxml")
#Find string
last_update_fra = soup.findAll(string=re.compile("Last Updated"))
time.sleep(60)
#download request again
soup = BeautifulSoup(requests.get(urls[0], headers=headers), "lxml")
if soup.findAll(string=re.compile("Last Updated")) == last_update_fra:
continue
else:
# Code to send email
*This waits a minute before checking for changes, but it can be adjusted as needed.
[Credit][1]: https://chrisalbon.com/python/web_scraping/monitor_a_website/
I've written a page monitor to receive the latest product link from Nike.com, but I only want it to return a link to me if it's from a product that has just been uploaded to the site. I haven't been able to find any help similar to this. This is the page monitor written in Python. Any help with Python returning only new links would be helpful.
import requests
from bs4 import BeautifulSoup
import time
import json
headers = {
'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/53.0.2785.143 Safari/537.36'
}
def item_finder():
source = requests.get('https://www.nike.com/launch/', headers=headers).text
soup = BeautifulSoup(source, 'lxml')
card = soup.find('figure', class_='item ncss-col-sm-12 ncss-col-md-6 ncss-col-lg-4 va-sm-t pb2-sm pb4-md prl0-sm prl2-md ')
card_data = "https://nike.com" + card.a.get('href')
print(card_data)
item_finder()
sorry to repost this question. someone migrated the question to a different site, without the cookies i could not comment or edit.
i'm new to python and bs4, please go easy on me.
#!/usr/bin/python3
import bs4 as bs
import urllib.request
import time, datetime, os, requests, lxml.html
import re
from fake_useragent import UserAgent
url = "https://www.cvedetails.com/vulnerability-list.php"
ua = UserAgent()
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
snkr = requests.get(url,headers=header)
soup = bs.BeautifulSoup(snkr.content,'lxml')
for item in soup.find_all('tr', class_="srrowns"):
print(item.td.next_sibling.next_sibling.a)
prints:
CVE-2017-6712
CVE-2017-6708
CVE-2017-6707
CVE-2017-1269
CVE-2017-0711
CVE-2017-0706
using the recommened string:
print(item.td.next_sibling.next_sibling.a.href)
prints:
None
None
None
None
None
None
can't figure out how to extract the /cve/CVE-2017-XXXX/ parts. purhaps i've gone about it the wrong way. i dont need the titles or html, just the uri's.
I think you should try something like:
for item in soup.find_all('tr', class_="srrowns"):
print(item.td.next_sibling.next_sibling.a['href'])