Extracting href using bs4/python3? (again) - python

sorry to repost this question. someone migrated the question to a different site, without the cookies i could not comment or edit.
i'm new to python and bs4, please go easy on me.
#!/usr/bin/python3
import bs4 as bs
import urllib.request
import time, datetime, os, requests, lxml.html
import re
from fake_useragent import UserAgent
url = "https://www.cvedetails.com/vulnerability-list.php"
ua = UserAgent()
header = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/59.0.3071.115 Safari/537.36'}
snkr = requests.get(url,headers=header)
soup = bs.BeautifulSoup(snkr.content,'lxml')
for item in soup.find_all('tr', class_="srrowns"):
print(item.td.next_sibling.next_sibling.a)
prints:
CVE-2017-6712
CVE-2017-6708
CVE-2017-6707
CVE-2017-1269
CVE-2017-0711
CVE-2017-0706
using the recommened string:
print(item.td.next_sibling.next_sibling.a.href)
prints:
None
None
None
None
None
None
can't figure out how to extract the /cve/CVE-2017-XXXX/ parts. purhaps i've gone about it the wrong way. i dont need the titles or html, just the uri's.

I think you should try something like:
for item in soup.find_all('tr', class_="srrowns"):
print(item.td.next_sibling.next_sibling.a['href'])

Related

Beautiful Soup \u003b appears and messes up find_all?

I've been working on a web scraper for top news sites. Beautiful Soup in python has been a great tool, letting me get full articles with very simple code. BUT
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source=session.get(article_url, headers=request_header).text
soup = BeautifulSoup(source,'lxml')
#get all <p> paragraphs from article
paragraphs=soup.find_all('p')
#print each paragraph as a line
for paragraph in paragraphs:
print(paragraph)
This works great on most news sites I've tried BUT for some reason The AP site gives me no output at all. Which is strange because the exact same code works on maybe 10 other sites like the NYT, WaPo, and The Hill. And I know why.
What it does is, where every other site prints out all the paragraphs, it prints nothing. Except when I look at the paragraphs soup variable, here is the kind of thing I see:
address the pandemic.\u003c/p>\u003cdiv class=\"ad-placeholder\">\u003c/div>\u003cp>Instead, public schools
Clearly what's happening is the < HTML symbol is being translated as \u003b. And because of that find_all('p') can't properly find the HTML tags. But for some reason only the AP site is doing it. When I inspect the AP website, their html has the same symbols as all the other sites.
Does anyone have any idea why this is happening? Or what I can do to fix it? Because I'm seriously confused
For me, at least, I had to extract a javascript object containing the data with regex, then parse with json into json object, then grab the value associated with the page html as you see it in browser, soup it and then extract the paragraphs. I removed the retries stuff; you can easily re-insert that.
import requests
#from requests.adapters import HTTPAdapter
#from requests.packages.urllib3.util.retry import Retry
from bs4 import BeautifulSoup
import re,json
article_url='https://apnews.com/article/lifestyle-travel-coronavirus-pandemic-health-education-418fe38201db53d2848e0138a28ff824'
user_agent='Mozilla/5.0 (Windows NT 6.2; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.90 Safari/537.36'
request_header={ 'User-Agent': user_agent}
source = requests.get(article_url, headers=request_header).text
data = json.loads(re.search(r"window\['titanium-state'\] = (.*)", source, re.M).group(1))
content = data['content']['data']
content = content[list(content.keys())[0]]
soup = BeautifulSoup(content['storyHTML'])
for p in soup.select('p'):
print(p.text.strip())
Regex:

my python app doesn't work and give a None as answer

Hi i would like to know why my app is giving me that error i've tried already everything what i found in google and still have not idea why is that happeing
import requests
from bs4 import BeautifulSoup
URL = 'https://www.amazon.co.uk/XLTOK-Charging-Transfer-Charger-Nintendo/dp/B0828RYQ7W/ref=sr_1_1_sspa?dchild=1&keywords=type+c&qid=1598485860&sr=8-1-spons&psc=1&spLa=ZW5jcnlwdGVkUXVhbGlmaWVyPUE3TDNSNUlITUNKTUMmZW5jcnlwdGVkSWQ9QTAwNDg4MTMyUlFQN0Y4RllGQzE2JmVuY3J5cHRlZEFkSWQ9QTAxNDk0NzMyMFNLSUdPU0taVUpRJndpZGdldE5hbWU9c3BfYXRmJmFjdGlvbj1jbGlja1JlZGlyZWN0JmRvTm90TG9nQ2xpY2s9dHJ1ZQ=='
headers = {"User-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.125 Safari/537.36'}
page = requests.get(URL, headers=headers)
soup = BeautifulSoup(page.content, 'html.parser')
title = soup.find(id="productTitle")
print(title)
When called:
C:\Proyecto nuevo>Python main.py
None
So if anyone would like to help me will be amazing!!
If you look at the webpage code (the one your are trying to scrape), you will find that it is pretty much all javascript that populates the webpage when it loads. The requests library gets this code and doesn't run it. Your "find title" gets none because the code doesn't contain that.
To scrape this page you will have to run the javascript on it. You can check out the Selenium WebDriver in python to do this.

How to avoid 403 problem using BeautifulSoup and headers?

I am using the combination of request and beautifulsoup to develop a web-scraping program in python.
Unfortunately, I got 403 problem (even using header).
Here my code:
from bs4 import BeautifulSoup
from requests import get
headers_m = ({'User-Agent':
'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'})
sapo_m = "https://www.idealista.it/vendita-case/milano-milano/"
response_m = get(sapo_m, headers=headers_m)
This is not general python question. The site blocks such straightforward attempts of scraping, you need to find a set of headers (specific for this site) that will pass validation.
Regards,
Simply use Chrome as User-Agent.
from bs4 import BeautifulSoup
BeautifulSoup(requests.get("https://...", headers={"User-Agent": "Chrome"}).content, 'html.parser')

Python crawler wont remove links from queue

I'm just learning Python so forgive my poor coding.
I am just trying to create a website crawler which will eventually create a sitemap, report broken links etc. But right at the beginning, I am getting stuck because when creating a queue of links to crawl, as they get crawled, I want them to be removed from the queue list. But for some reason, there are duplicate URLs in the crawled list as there are in the queue list. I am guessing there is something wrong with the loop perhaps, but I am not sure.
Any help will be greatly appreciated.
my linkfinder file looks like this;
And then in my main file I am simply calling two functions in this order;
find_page_links(url)
crawler(url)
from urllib import parse
from urllib.parse import urljoin
from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
import urllib.request
import re
from urllib.parse import urlparse
from general import file_to_set, set_to_file
queue_file = 'savvy/queue.txt'
crawled_file = 'savvy/crawled.txt'
def find_page_links(page_url):
crawled = file_to_set(crawled_file)
queued = file_to_set(queue_file)
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req = urllib.request.Request(page_url, headers=headers)
response = urllib.request.urlopen(req)
res = BeautifulSoup(response.read(),"html.parser")
for link in res.find_all('a'):
aLink = urljoin(page_url,link.get('href'))
if page_url in aLink:
queued.add(aLink)
crawled.add(page_url)
queued.remove(page_url)
set_to_file(crawled, crawled_file)
set_to_file(queued, queue_file)
return queued
def crawler(base_url):
crawled = file_to_set(crawled_file)
queued = file_to_set(queue_file)
for link in queued.copy():
if link not in crawled:
find_page_links2(base_url, link, queued, crawled)
else:
queued.remove(link)
def find_page_links2(base_url, page_url, queued, crawled):
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
req = urllib.request.Request(page_url, headers=headers)
response = urllib.request.urlopen(req)
res = BeautifulSoup(response.read(),"html.parser")
for link in res.find_all('a'):
aLink = urljoin(page_url,link.get('href'))
if base_url in aLink:
queued.add(aLink)
crawled.add(page_url)
queued.remove(page_url)
set_to_file(crawled, crawled_file)
set_to_file(queued, queue_file)
I think the reason you get duplicates is because you simply check the html for and place that link inside your queue. The problem here is that there might be several 's that link to the same page.
An easy fix might be to simply get all links (just as you do now) and then remove duplicates before you start crawling.
Else you can check if the link is already in the queue and then only add new links?

How to do scraping from a page with BeautifulSoup

The question asked is very simple, but for me, it doesn't work and I don't know!
I want to scrape the rating beer from this page https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone with BeautifulSoup, but it doesn't work.
This is my code:
import requests
import bs4
from bs4 import BeautifulSoup
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
test_html = requests.get(url).text
soup = BeautifulSoup(test_html, "lxml")
rating = soup.findAll("span", class_="ratingValue")
rating
When I finish, it doesn't work, but if I do the same thing with another page is work... I don't know. Someone can help me? The result of rating is 4.58
Thanks everybody!
If you print the test_html, you'll find you get a 403 forbidden response.
You should add a header (at least a user-agent : ) ) to your GET request.
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'html5lib')
rating = soup.find('span', {'itemprop': 'ratingValue'})
print(rating.text)
# 4.58
The reason behind getting forbidden status code (HTTP error 403) which means the server will not fulfill your request despite understanding the response. You will definitely get this error if you try scrape a lot of the more popular websites which will have security features to prevent bots. So you need to disguise your request!
For that you need use Headers.
Also you need correct your tag attribute whose data you're trying to get i.e. itemprop
use lxml as your tree builder, or any other of your choice
import requests
from bs4 import BeautifulSoup
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone'
# Add this
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.86 Safari/537.36'}
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'lxml')
rating = soup.find('span', {'itemprop':'ratingValue'})
print(rating.text)
The page you are requesting response as 403 forbidden so you might not be getting an error but it will provide you blank result as []. To avoid this situation we add user agent and this code will get you the desired result.
import urllib.request
from bs4 import BeautifulSoup
user_agent = 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
url = "https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone"
headers={'User-Agent':user_agent}
request=urllib.request.Request(url,None,headers) #The assembled request
response = urllib.request.urlopen(request)
soup = BeautifulSoup(response, "lxml")
rating = soup.find('span', {'itemprop':'ratingValue'})
rating.text
import requests
from bs4 import BeautifulSoup
headers = {
'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64)
AppleWebKit/537.36
(KHTML, like Gecko) Chrome/74.0.3729.108 Safari/537.36'
}
url = 'https://www.brewersfriend.com/homebrew/recipe/view/16367/southerntier-pumking
clone'
test_html = requests.get(url, headers=headers).text
soup = BeautifulSoup(test_html, 'html5lib')
rating = soup.find('span', {'itemprop': 'ratingValue'})
print(rating.text)
you are facing this error because some websites can't be scraped by beautiful soup. So for these kinds of websites, you have to use selenium
download latest chrome driver from this link according to your operating system
install selenium driver by this command "pip install selenium"
# import required modules
import selenium
from selenium import webdriver
from bs4 import BeautifulSoup
import time, os
curren_dir = os.getcwd()
print(curren_dir)
# concatinate web driver with your current dir && if you are using window change "/" to '\' .
# make sure , you placed chromedriver in current directory
driver = webdriver.Chrome(curren_dir+'/chromedriver')
# driver.get open url on your browser
driver.get('https://www.brewersfriend.com/homebrew/recipe/view/16367/southern-tier-pumking-clone')
time.sleep(1)
# it fetch data html data from driver
super_html = driver.page_source
# now convert raw data with 'html.parser'
soup=BeautifulSoup(super_html,"html.parser")
rating = soup.findAll("span",itemprop="ratingValue")
rating[0].text

Categories