I'm doing web scraping of a web page called: CVE Trends
import bs4, requests,webbrowser
LINK = "https://cvetrends.com/"
PRE_LINK = "https://nvd.nist.gov/"
response = requests.get(LINK)
response.raise_for_status()
soup=bs4.BeautifulSoup(response.text,'html.parser')
div_tweets=soup.find('div',class_='tweet_text')
a_tweets=div_tweets.find_all('a')
link_tweets =[]
for a_tweet in a_tweets:
link_tweet= str(a_tweet.get('href'))
if PRE_LINK in link_tweet:
link_tweets.append(link_tweet)
from pprint import pprint
pprint(link_tweets)
This is the code that I've written so far. I've tried in many ways but it gives always the same error:
'NoneType' object has no attribute 'find_all'
Can someone help me please? I really need this.
Thanks in advance for any answer.
This is due to not getting response you exactly want.
https://cvetrends.com/
This website have java-script loaded content,so you will not get data in request.
instead of scraping website you will get data from https://cvetrends.com/api/cves/24hrs
here is some solution:
import requests
import json
from urlextract import URLExtract
LINK = "https://cvetrends.com/api/cves/24hrs"
PRE_LINK = "https://nvd.nist.gov/"
link_tweets = []
# library for url extraction
extractor = URLExtract()
# ectract response from LINK (json Response)
html = requests.get(LINK).text
# convert string to json object
twitt_json = json.loads(html)
twitt_datas = twitt_json.get('data')
for twitt_data in twitt_datas:
# extract tweets
twitts = twitt_data.get('tweets')
for twitt in twitts:
# extract tweet texts and validate condition
twitt_text = twitt.get('tweet_text')
if PRE_LINK in twitt_text:
# find urls from text
urls_list = extractor.find_urls(twitt_text)
for url in urls_list:
if PRE_LINK in url:
link_tweets.append(twitt_text)
print(link_tweets)
This is happening because soup.find("div", class_="tweet_text") is not finding anything, so it returns None. This is happening because the site you're trying to scrape is populated using javascript, so when you send a get request to the site, this is what you're getting back:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<title>
CVE Trends - crowdsourced CVE intel
</title>
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="description"/>
<meta content="trending CVEs, CVE intel, CVE trends" name="keywords"/>
<meta content="CVE Trends - crowdsourced CVE intel" name="title" property="og:title">
<meta content="Simon Bell" name="author"/>
<meta content="website" property="og:type">
<meta content="https://cvetrends.com/images/cve-trends.png" name="image" property="og:image">
<meta content="https://cvetrends.com" property="og:url">
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." property="og:description"/>
<meta content="en_GB" property="og:locale"/>
<meta content="en_US" property="og:locale:alternative"/>
<meta content="CVE Trends" property="og:site_name"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="#SimonByte" name="twitter:creator"/>
<meta content="CVE Trends - crowdsourced CVE intel" name="twitter:title"/>
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="twitter:description"/>
<meta content="https://cvetrends.com/images/cve-trends.png" name="twitter:image"/>
<link href="https://cvetrends.com/favicon.ico" id="favicon" rel="icon" sizes="32x32"/>
<link href="https://cvetrends.com/apple-touch-icon.png" id="apple-touch-icon" rel="apple-touch-icon"/>
<link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/5.1.0/css/bootstrap.min.css" rel="stylesheet"/>
</meta>
</meta>
</meta>
</meta>
</head>
<body>
<div id="root">
</div>
<noscript>
Please enable JavaScript to run this app.
</noscript>
<script src="https://cvetrends.com/js/main.d0aa7136854f54748577.bundle.js">
</script>
</body>
</html>
You can verify this using print(soup.prettify()).
To be able to scrape this site you'll probable have to use something like Selenium.
This question already has an answer here:
HTML tag appears empty when parsing it with BeautifulSoup but has content when opened in browser
(1 answer)
Closed 2 years ago.
I am trying to scrape a table from a website:
After importing the url
print(soup.prettify())
<!DOCTYPE html>
<html lang="en">
<head>
<meta content="noindex" name="robots"/>
<meta charset="utf-8"/>
<meta content="width=device-width,initial-scale=1,shrink-to-fit=no" name="viewport"/>
<link href="https://d9mzsvqupf0ma.cloudfront.net/0367505b9e/static/react/favicon.ico" rel="shortcut icon"/>
<title>
Reonomy
</title>
<script src="/static/react/env.js?1592498512097">
</script>
<script onerror='console.error("Error loading Google Maps. Please check your firewall, proxy, or ad blocker settings.")' src="//maps.googleapis.com/maps/api/js?v=3&libraries=places,drawing,geometry&client=gme-scryerinc">
</script>
<script type="text/javascript">
!function(){if(void 0!==window.env&&"production"===window.env.REACT_APP_ENVIRONMENT){var i=window.analytics=window.analytics||[];if(!i.initialize)if(i.invoked)window.console&&console.error&&console.error("Segment snippet included twice.");else{i.invoked=!0,i.methods=["trackSubmit","trackClick","trackLink","trackForm","pageview","identify","reset","group","track","ready","alias","debug","page","once","off","on"],i.factory=function(t){return function(){var e=Array.prototype.slice.call(arguments);return e.unshift(t),i.push(e),i}};for(var e=0;e<i.methods.length;e++){var t=i.methods[e];i[t]=i.factory(t)}i.load=function(e,t){var n=document.createElement("script");n.type="text/javascript",n.async=!0,n.src="https://cdn.segment.com/analytics.js/v1/"+e+"/analytics.min.js";var o=document.getElementsByTagName("script")[0];o.parentNode.insertBefore(n,o),i._loadOptions=t},i.SNIPPET_VERSION="4.1.0",i.load("Jb0xYxcgY3BJTcGWoAmtUP9qwhM9V2pp")}}}()
</script>
<link href="https://d9mzsvqupf0ma.cloudfront.net/0367505b9e/static/react/static/css/main.4f4bf592.chunk.css" rel="stylesheet"/>
</head>
<body>
<noscript>
You need to enable JavaScript to run this app.
</noscript>
<div id="root">
</div>
<script>
!function(d){function e(e){for(var t,r,n=e[0],c=e[1],o=e[2],a=0,f=[];a<n.length;a++)r=n[a],Object.prototype.hasOwnProperty.call(s,r)&&s[r]&&f.push(s[r][0]),s[r]=0;for(t in c)Object.prototype.hasOwnProperty.call(c,t)&&(d[t]=c[t]);for(h&&h(e);f.length;)f.shift()();return i.push.apply(i,o||[]),u()}function u(){for(var e,t=0;t<i.length;t++){for(var r=i[t],n=!0,c=1;c<r.length;c++){var o=r[c];0!==s[o]&&(n=!1)}n&&(i.splice(t--,1),e=p(p.s=r[0]))}return e}var r={},l={5:0},s={5:0},i=[];function p(e){if(r[e])return r[e].exports;var t=r[e]={i:e,l:!1,exports:{}};return d[e].call(t.exports,t,t.exports,p),t.l=!0,t.exports}p.e=function(i){var e=[];l[i]?e.push(l[i]):0!==l[i]&&{20:1,21:1,24:1,25:1}[i]&&e.push(l[i]=new Promise(function(e,n){for(var t="static/css/"+({}[i]||i)+"."+{0:"31d6cfe0",1:"31d6cfe0",2:"31d6cfe0",3:"31d6cfe0",7:"31d6cfe0",8:"31d6cfe0",9:"31d6cfe0",10:"31d6cfe0",11:"31d6cfe0",12:"31d6cfe0",13:"31d6cfe0",14:"31d6cfe0",15:"31d6cfe0",16:"31d6cfe0",17:"31d6cfe0",18:"31d6cfe0",19:"31d6cfe0",20:"7bbd82a1",21:"989321a7",22:"31d6cfe0",23:"31d6cfe0",24:"d608a43c",25:"36cb7054",26:"31d6cfe0",27:"31d6cfe0",28:"31d6cfe0",29:"31d6cfe0",30:"31d6cfe0",31:"31d6cfe0",32:"31d6cfe0"}[i]+".chunk.css",c=p.p+t,r=document.getElementsByTagName("link"),o=0;o<r.length;o++){var a=(d=r[o]).getAttribute("data-href")||d.getAttribute("href");if("stylesheet"===d.rel&&(a===t||a===c))return e()}var f=document.getElementsByTagName("style");for(o=0;o<f.length;o++){var d;if((a=(d=f[o]).getAttribute("data-href"))===t||a===c)return e()}var u=document.createElement("link");u.rel="stylesheet",u.type="text/css",u.onload=e,u.onerror=function(e){var t=e&&e.target&&e.target.src||c,r=new Error("Loading CSS chunk "+i+" failed.\n("+t+")");r.code="CSS_CHUNK_LOAD_FAILED",r.request=t,delete l[i],u.parentNode.removeChild(u),n(r)},u.href=c,document.getElementsByTagName("head")[0].appendChild(u)}).then(function(){l[i]=0}));var r=s[i];if(0!==r)if(r)e.push(r[2]);else{var t=new Promise(function(e,t){r=s[i]=[e,t]});e.push(r[2]=t);var n,c=document.createElement("script");c.charset="utf-8",c.timeout=120,p.nc&&c.setAttribute("nonce",p.nc),c.src=p.p+"static/js/"+({}[i]||i)+"."+{0:"ca0cfe7f",1:"1f775947",2:"f3aa526c",3:"8e92118a",7:"8821eefa",8:"e17401b1",9:"6e4ba317",10:"24f1a107",11:"96c5e7b8",12:"7a6ef661",13:"e539811a",14:"37c1ffc4",15:"dc8d4356",16:"2d61de04",17:"23eefbbb",18:"51a9cf50",19:"7f8a5cf4",20:"c409a0e9",21:"00e0dc95",22:"de275a36",23:"114fe889",24:"a1c29240",25:"b1426e77",26:"2eaf037b",27:"cf150351",28:"ac391d82",29:"b2c0bc67",30:"4b510904",31:"5a5b63b1",32:"f8a3d31f"}[i]+".chunk.js";var o=new Error;n=function(e){c.onerror=c.onload=null,clearTimeout(a);var t=s[i];if(0!==t){if(t){var r=e&&("load"===e.type?"missing":e.type),n=e&&e.target&&e.target.src;o.message="Loading chunk "+i+" failed.\n("+r+": "+n+")",o.name="ChunkLoadError",o.type=r,o.request=n,t[1](o)}s[i]=void 0}};var a=setTimeout(function(){n({type:"timeout",target:c})},12e4);c.onerror=c.onload=n,document.head.appendChild(c)}return Promise.all(e)},p.m=d,p.c=r,p.d=function(e,t,r){p.o(e,t)||Object.defineProperty(e,t,{enumerable:!0,get:r})},p.r=function(e){"undefined"!=typeof Symbol&&Symbol.toStringTag&&Object.defineProperty(e,Symbol.toStringTag,{value:"Module"}),Object.defineProperty(e,"__esModule",{value:!0})},p.t=function(t,e){if(1&e&&(t=p(t)),8&e)return t;if(4&e&&"object"==typeof t&&t&&t.__esModule)return t;var r=Object.create(null);if(p.r(r),Object.defineProperty(r,"default",{enumerable:!0,value:t}),2&e&&"string"!=typeof t)for(var n in t)p.d(r,n,function(e){return t[e]}.bind(null,n));return r},p.n=function(e){var t=e&&e.__esModule?function(){return e.default}:function(){return e};return p.d(t,"a",t),t},p.o=function(e,t){return Object.prototype.hasOwnProperty.call(e,t)},p.p="https://d9mzsvqupf0ma.cloudfront.net/0367505b9e/static/react/",p.oe=function(e){throw console.error(e),e};var t=this.webpackJsonpfrontend=this.webpackJsonpfrontend||[],n=t.push.bind(t);t.push=e,t=t.slice();for(var c=0;c<t.length;c++)e(t[c]);var h=n;u()}([])
</script>
<script src="https://d9mzsvqupf0ma.cloudfront.net/0367505b9e/static/react/static/js/6.41e506b7.chunk.js">
</script>
<script src="https://d9mzsvqupf0ma.cloudfront.net/0367505b9e/static/react/static/js/main.e68cecb8.chunk.js">
</script>
</body>
</html>
When I inspect the website, I see that my table is there between tags:
Still when I use :
print(soup.find_all('td'))
It returns me an empty list. Can someone point out what I did wrong ?
Beautifulsoup, doesn't evaluate javascript.
It looks like all those tables are being generated by Javascript. You could use dryscape to evaluate the page before passing it on to beautiful soup.
I was planning on creating a basic web scraper for the site Sneakersnstuff.com however my efforts were stopped early due to an error. When requesting to the url https://www.sneakersnstuff.com/, rather than displaying the html of the website, or even the entrance captcha, I am redirected to a cloudflare page with the error message "enable cookies". Both my code and the response are shown below
import requests
import cfscrape
session = requests.session()
response = session.get('https://www.sneakersnstuff.com/')
print(response.headers)
<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js ie6 oldie" lang="en-US"> <![endif]-->
<!--[if IE 7]> <html class="no-js ie7 oldie" lang="en-US"> <![endif]-->
<!--[if IE 8]> <html class="no-js ie8 oldie" lang="en-US"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" lang="en-US">
<!--<![endif]-->
<head>
<title>Access denied | www.sneakersnstuff.com used Cloudflare to restrict access</title>
<meta charset="UTF-8" />
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<meta http-equiv="X-UA-Compatible" content="IE=Edge,chrome=1" />
<meta name="robots" content="noindex, nofollow" />
<meta name="viewport" content="width=device-width,initial-scale=1,maximum-scale=1" />
<link rel="stylesheet" id="cf_styles-css" href="/cdn-cgi/styles/cf.errors.css" type="text/css"
media="screen,projection" />
<!--[if lt IE 9]><link rel="stylesheet" id='cf_styles-ie-css' href="/cdn-cgi/styles/cf.errors.ie.css" type="text/css" media="screen,projection" /><![endif]-->
<style type="text/css">
body {
margin: 0;
padding: 0
}
</style>
<!--[if gte IE 10]><!-->
<script type="text/javascript" src="/cdn-cgi/scripts/zepto.min.js"></script>
<!--<![endif]-->
<!--[if gte IE 10]><!-->
<script type="text/javascript" src="/cdn-cgi/scripts/cf.common.js"></script>
<!--<![endif]-->
</head>
<body>
<div id="cf-wrapper">
<div class="cf-alert cf-alert-error cf-cookie-error" id="cookie-alert" data-translate="enable_cookies">Please
enable cookies.</div>
<div id="cf-error-details" class="cf-error-details-wrapper">
<div class="cf-wrapper cf-header cf-error-overview">
<h1>
<span class="cf-error-type" data-translate="error">Error</span>
<span class="cf-error-code">1020</span>
<small class="heading-ray-id">Ray ID: 578133293d83e0d6 • 2020-03-22 16:13:25 UTC</small>
</h1>
<h2 class="cf-subheadline">Access denied</h2>
</div><!-- /.header -->
<section></section><!-- spacer -->
<div class="cf-section cf-wrapper">
<div class="cf-columns two">
<div class="cf-column">
<h2 data-translate="what_happened">What happened?</h2>
<p>This website is using a security service to protect itself from online attacks.</p>
</div>
</div>
</div><!-- /.section -->
<div class="cf-error-footer cf-wrapper">
<p>
<span class="cf-footer-item">Cloudflare Ray ID: <strong>578133293d83e0d6</strong></span>
<span class="cf-footer-separator">•</span>
<span class="cf-footer-item"><span>Your IP</span>: 96.241.108.243</span>
<span class="cf-footer-separator">•</span>
<span class="cf-footer-item"><span>Performance & security by</span> <a
href="https://www.cloudflare.com/5xx-error-landing?utm_source=error_footer" id="brand_link"
target="_blank">Cloudflare</a></span>
</p>
</div><!-- /.error-footer -->
</div><!-- /#cf-error-details -->
</div><!-- /#cf-wrapper -->
<script type="text/javascript">
window._cf_translation = {};
</script>
</body>
</html>
I have attempted using a library reccomend by many called cfscrape to no avail.
Adding Browser/User-Agent Filtering to cloudscraper did the trick for me.
import cloudscraper
from bs4 import BeautifulSoup
# Adding Browser / User-Agent Filtering should help ie.
# will give you only desktop firefox User-Agents on Windows
scraper = cloudscraper.create_scraper(browser={'browser': 'firefox','platform': 'windows','mobile': False})
html = scraper.get("https://www.sneakersnstuff.com/").content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
import cloudscraper
from bs4 import BeautifulSoup
scraper = cloudscraper.create_scraper()
html = scraper.get("https://www.sneakersnstuff.com/").content
soup = BeautifulSoup(html, 'html.parser')
print(soup)
Output:
cloudscraper.exceptions.CloudflareReCaptchaProvider: Cloudflare reCaptcha detected, unfortunately you haven't loaded an anti reCaptcha provider correctly via the 'recaptcha' parameter.
Next Step ?
3rd Party reCaptcha Solvers
Description
cloudscraper currently supports the following 3rd party reCaptcha solvers, should you require them.
anticaptcha
deathbycaptcha
2captcha
9kw
return_response
I'm trying to extract a PDF from this site that uses the native Google Chrome pdf viewer tool to open the pdf in the first place, it's content type is /application/pdf. The issue is that the site URLs that I get aren't actually links to the PDF but rather to a .zul site where the js will load the pdf, or fetch it.
Here's my download code below:
def download_pdf(url, idx, save_dir):
options = webdriver.ChromeOptions()
profile = {"plugins.plugins_list": [{"enabled":False,"name":"Chrome PDF Viewer"}],
"download.default_directory" : save_dir}
options.add_experimental_option("prefs",profile)
driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
driver.get(url)
The problem that Im encountering with the above code is that I get the following readout from driver.source_page:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Pragma" content="no-cache" />
<meta http-equiv="Expires" content="-1" />
<title>Document Viewer</title>
<link rel="stylesheet" type="text/css" href="/eSMARTContracts/zkau/web/9776a7f0/zul/css/zk.wcs;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1"/>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zk.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<script type="text/javascript" src="/eSMARTContracts/zkau/web/9776a7f0/js/zul.lang.wpd;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1" charset="UTF-8">
</script>
<!-- ZK 6.0.2 EE 2012072410 -->
</head>
<body>
<div id="j4AP_" class="z-temp"></div>
<script class="z-runonce" type="text/javascript">zk.pi=1;zkmx(
[0,'j4AP_',{dt:'z_2m1',cu:'/eSMARTContracts;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',uu:'/eSMARTContracts/zkau;jsessionid=088DC94ECA6804AF717A0E997E4F1444.node1',ru:'/service/dpsweb/ViewDPSWeb.zul'},[
['zul.wnd.Window','j4AP0',{$$onSize:false,$$onMaximize:false,$$onOpen:false,$$onMinimize:false,$$onZIndex:false,$onClose:true,$$onMove:false,width:'100%',height:'100%',prolog:'\
'},[]]]]);
</script>
<noscript>
<div class="noscript"><p>Sorry, JavaScript must be enabled.<br/>Change your browser options, then try again.</p></div>
</noscript>
</body>
</html>
EDIT: Included the link
I want to copy all the code of an URL (http://modelseed.org/biochem/reactions/rxn00001) using Python 3.6, but I can only copy part of the code, and I don't know why.
So far, I tried with "requests" module
import requests
page = requests.get("http://modelseed.org/biochem/reactions/rxn00001")
print(page.content)
and "urllib"
import urllib.request
site = urllib.request.urlopen("http://modelseed.org/biochem/reactions/rxn00001")
print(site.read())
The part of the code with information of the "Reaction Details", like "Name", "ID" and "Abbreviation" are missing, but they are visible if I inspect the code on the developer bar of Chrome.
The code I'm able to download using the two codes above is:
<!DOCTYPE html>
<html lang="en" ng-app="ModelSEED">
<head>
<base href="/"/>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<meta content="IE=edge" http-equiv="X-UA-Compatible"/>
<meta content="initial-scale=1, maximum-scale=1, user-scalable=no" name="viewport">
<meta content="The ModelSEED is a resource for the reconstruction, exploration, comparison, and analysis of metabolic models." name="description"/>
<link href="/img/ModelSEED-favicon.png?v=2.0" rel="shortcut icon"/>
<meta content="nconrad" name="author"/>
<title>
ModelSEED
</title>
<link href="components/angular-material/angular-material.css" rel="stylesheet"/>
<link href="components/bootstrap/dist/css/bootstrap.min.css" rel="stylesheet"/>
<!-- to be removed -->
<link href="components/font-awesome/css/font-awesome.min.css" rel="stylesheet"/>
<link href="icomoon/style.css" rel="stylesheet"/>
<link href="https://fonts.googleapis.com/icon?family=Material+Icons" rel="stylesheet"/>
<link href="http://fonts.googleapis.com/css?family=Montserrat:400,700" rel="stylesheet" type="text/css"/>
<link href="build/style.css" rel="stylesheet"/>
<!--<script src="https://cdn.socket.io/socket.io-1.3.7.js"></script>-->
<script src="build/site.js">
</script>
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
</meta>
</head>
<body>
<div style="height: 100%;" ui-view="">
</div>
<script>
(function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
})(window,document,'script','//www.google-analytics.com/analytics.js','ga');
ga('create', 'UA-67412611-1', 'auto');
ga('send', 'pageview');
</script>
</body>
</html>
Anyone has any hint why the code between < div style="height: 100%;" ui-view="" > and (just after < body > and before < script >) is not downloaded?
Thank you.
It's being inserted by a javascript script, therefore, either requests nor urllib would find it, you would need to use a browser for this, you should try with selenium or PhantomJS
something like:
from selenium import webdriver
driver = webdriver.Chrome('./chromedriver')
driver.get(url)
driver.page_source
Try getting this url instead: https://www.patricbrc.org/api/model_reaction/?http_accept=application/json&eq(id,rxn00001)