Python Craiglist Scraping shows empty list - python

Hi I was using following code to scrape craiglist.
import pandas as pd
import requests
%pylab inline
url_base = 'http://houston.craigslist.org/search/apa'
params = dict(bedrooms=2)
rsp = requests.get(url_base, params=params)
print(rsp.text[:500])
from bs4 import BeautifulSoup as bs4
html = bs4(rsp.text, 'html.parser')
print(html.prettify()[:1000])
everything works fine till above and the output is :-
<!DOCTYPE html>
<html class="no-js">
<head>
<title>
houston apartments / housing rentals - craigslist
</title>
<meta content="houston apartments / housing rentals - craigslist"
name="description">
<meta content="IE=Edge" http-equiv="X-UA-Compatible"/>
<link href="https://houston.craigslist.org/search/apa" rel="canonical">
<link href="https://houston.craigslist.org/search/apa?
format=rss&min_bedrooms=2" rel="alternate" title="RSS feed for
craigslist | houston apartments / housing rentals - craigslist "
type="application/rss+xml">
<link href="https://houston.craigslist.org/search/apa?
s=120&min_bedrooms=2" rel="next">
<meta content="width=device-width,initial-scale=1" name="viewport">
<link href="//www.craigslist.org/styles/cl.css?
v=a14d0c65f7978c2bbc0d780a3ea7b7be" media="all" rel="stylesheet"
type="text/css">
<link href="//www.craigslist.org/styles/search.css?v=27e1d4246df60da5ffd1146d59a8107e" media="all" rel="stylesheet" type="
It clearly shows that the list is not empty and there are items which i can use. This is use the following code:-
apts = html.find_all('p', attrs={'class': 'row'})
print(len(apts))
The above output of print(len(apts)) is 0..
can anyone please helkp in correcting this code. I do believe there is some change in the craiglist html parser but i dont know how to implement it here.
Thanks

There is no <p> tag with 'row' class instead <p> has 'result-info' class.
import requests
url_base = 'http://houston.craigslist.org/search/apa'
params = dict(bedrooms=2)
rsp = requests.get(url_base, params=params)
print(rsp.text[:500])
from bs4 import BeautifulSoup as bs4
html = bs4(rsp.text, 'html.parser')
print(html.prettify()[:1000])
apts = html.find_all('p', attrs={'class': 'result-info'})
print(len(apts))

Related

ERROR: 'NoneType' object has no attribute 'find_all'

I'm doing web scraping of a web page called: CVE Trends
import bs4, requests,webbrowser
LINK = "https://cvetrends.com/"
PRE_LINK = "https://nvd.nist.gov/"
response = requests.get(LINK)
response.raise_for_status()
soup=bs4.BeautifulSoup(response.text,'html.parser')
div_tweets=soup.find('div',class_='tweet_text')
a_tweets=div_tweets.find_all('a')
link_tweets =[]
for a_tweet in a_tweets:
link_tweet= str(a_tweet.get('href'))
if PRE_LINK in link_tweet:
link_tweets.append(link_tweet)
from pprint import pprint
pprint(link_tweets)
This is the code that I've written so far. I've tried in many ways but it gives always the same error:
'NoneType' object has no attribute 'find_all'
Can someone help me please? I really need this.
Thanks in advance for any answer.
This is due to not getting response you exactly want.
https://cvetrends.com/
This website have java-script loaded content,so you will not get data in request.
instead of scraping website you will get data from https://cvetrends.com/api/cves/24hrs
here is some solution:
import requests
import json
from urlextract import URLExtract
LINK = "https://cvetrends.com/api/cves/24hrs"
PRE_LINK = "https://nvd.nist.gov/"
link_tweets = []
# library for url extraction
extractor = URLExtract()
# ectract response from LINK (json Response)
html = requests.get(LINK).text
# convert string to json object
twitt_json = json.loads(html)
twitt_datas = twitt_json.get('data')
for twitt_data in twitt_datas:
# extract tweets
twitts = twitt_data.get('tweets')
for twitt in twitts:
# extract tweet texts and validate condition
twitt_text = twitt.get('tweet_text')
if PRE_LINK in twitt_text:
# find urls from text
urls_list = extractor.find_urls(twitt_text)
for url in urls_list:
if PRE_LINK in url:
link_tweets.append(twitt_text)
print(link_tweets)
This is happening because soup.find("div", class_="tweet_text") is not finding anything, so it returns None. This is happening because the site you're trying to scrape is populated using javascript, so when you send a get request to the site, this is what you're getting back:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8"/>
<meta content="width=device-width, initial-scale=1, shrink-to-fit=no" name="viewport"/>
<title>
CVE Trends - crowdsourced CVE intel
</title>
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="description"/>
<meta content="trending CVEs, CVE intel, CVE trends" name="keywords"/>
<meta content="CVE Trends - crowdsourced CVE intel" name="title" property="og:title">
<meta content="Simon Bell" name="author"/>
<meta content="website" property="og:type">
<meta content="https://cvetrends.com/images/cve-trends.png" name="image" property="og:image">
<meta content="https://cvetrends.com" property="og:url">
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." property="og:description"/>
<meta content="en_GB" property="og:locale"/>
<meta content="en_US" property="og:locale:alternative"/>
<meta content="CVE Trends" property="og:site_name"/>
<meta content="summary_large_image" name="twitter:card"/>
<meta content="#SimonByte" name="twitter:creator"/>
<meta content="CVE Trends - crowdsourced CVE intel" name="twitter:title"/>
<meta content="Monitor real-time, crowdsourced intel about trending CVEs on Twitter." name="twitter:description"/>
<meta content="https://cvetrends.com/images/cve-trends.png" name="twitter:image"/>
<link href="https://cvetrends.com/favicon.ico" id="favicon" rel="icon" sizes="32x32"/>
<link href="https://cvetrends.com/apple-touch-icon.png" id="apple-touch-icon" rel="apple-touch-icon"/>
<link href="https://cdnjs.cloudflare.com/ajax/libs/twitter-bootstrap/5.1.0/css/bootstrap.min.css" rel="stylesheet"/>
</meta>
</meta>
</meta>
</meta>
</head>
<body>
<div id="root">
</div>
<noscript>
Please enable JavaScript to run this app.
</noscript>
<script src="https://cvetrends.com/js/main.d0aa7136854f54748577.bundle.js">
</script>
</body>
</html>
You can verify this using print(soup.prettify()).
To be able to scrape this site you'll probable have to use something like Selenium.

Change HTML text and saveback to HTML

I am working on a simple way to wrap each sentence of an ebook formatted in HTML in span tags.
I am using a trained machine learning model to classify end of sentence punctuation (".!?" ...) and get the real sentences boundaries (ex: in U.S.A, "S" is not considered a sentence).
The problem is, in order to feed my model correct data, I need to first extract the text out of my HTML ebook (using BeautifulSoup's get_text('\n')).
Right now, I am able to wrap the output of get_text('\n') in span tags. But I can't just save that since I loose all the other tags used in the original HTML ebook.
Example HTML ebook sample:
<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><link href="style.css" rel="stylesheet" type="text/css" /><title> Name. Of the book. </title></head> ...
</div>
After get_text
Name. Of the book.
After running my algorithm:
<span>Name. Of the book.</span>
How can I get this output instead:
<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><link href="style.css" rel="stylesheet" type="text/css" /><title> <span>Name. Of the book.</span> </title></head> ...
</div>
Thank you in advance for your help!
You can use wrap() method (doc) to wrap the text into <span> tags - it will update the whole HTML structure.
Example:
data = '''<html><head><meta http-equiv="Content-Type" content="text/html;charset=utf-8" /><link href="style.css" rel="stylesheet" type="text/css" /><title> Name. Of the book. </title></head>'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'html.parser')
print('Before:')
print('-' * 80)
print(soup.prettify())
print('-' * 80)
for text in soup.find_all(text=True):
text.wrap(soup.new_tag("span")) # use wrap() function to wrap the text into <span> tag
print('After:')
print('-' * 80)
print(soup.prettify())
print('-' * 80)
Prints (notice the <span> inside the <title> tag):
Before:
--------------------------------------------------------------------------------
<html>
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<link href="style.css" rel="stylesheet" type="text/css"/>
<title>
Name. Of the book.
</title>
</head>
</html>
--------------------------------------------------------------------------------
After:
--------------------------------------------------------------------------------
<html>
<head>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<link href="style.css" rel="stylesheet" type="text/css"/>
<title>
<span>
Name. Of the book.
</span>
</title>
</head>
</html>
--------------------------------------------------------------------------------
Okay so I have a pretty naive but quite effective approach. You can get the entire html code first and then store it in a string and then use Regular Expression on it to extract the texts of the span tag.
This is the only way I can think of as of now. Hope this helps :)

BeautifulSoup parser adds unnecessary closing html tags

For example
you have html like
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</head>
python:
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'html.parser')
print(soup.prettify())
And if you parse it using BeautifulSoup in python and print it with prettify it will give output like this
output:
<html>
<head>
<meta charset="UTF-8">
<meta name="description" content="Free Web tutorials">
<meta name="keywords" content="HTML,CSS,XML,JavaScript">
<meta name="author" content="John Doe">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
</meta>
</meta>
</meta>
</meta>
</meta>
</head>
but if you have html meta tag like
<meta name="description" content="Free Web tutorials" />
It will give output as it is. It won't add an ending tag
so how to stop BeautifulSoup from adding unnecessary ending tags?
To solve this you just need to change your html parser to lxml parser
then you python script will be
from bs4 import BeautifulSoup as bs
import urllib3
URL = 'html file'
http = urllib3.PoolManager()
page = http.request('GET', URL)
soup = bs(page.data, 'lxml')
print(soup.prettify())
you just need to change soup = bs(page.data, 'html.parser') to soup = bs(page.data, 'lxml')

Python scraping of dynamic content (visual different from html source code)

I'm a big fan of stackoverflow and typically find solutions to my problems through this website. However, the following problem has bothered me for so long that it forced me to create an account here and ask directly:
I'm trying to scape this link: https://permid.org/1-21475776041 What i want is the row "TRCS Asset Class" and "Currency".
For starters, I'm using this code:
from bs4 import BeautifulSoup
import urllib2
url = 'https://permid.org/1-21475776041'
req = urllib2.urlopen(url)
raw = req.read()
soup = BeautifulSoup(raw)
print soup.prettify()
The html code returned (see below) is different from what you can see in your browser upon clicking the link:
<!DOCTYPE html>
<!--[if lt IE 7]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html ng-app="tmsMdaasApp" class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!-->
<html class="no-js" ng-app="tmsMdaasApp">
<!--<![endif]-->
<head>
<meta content="text/html; charset=utf-8" http-equiv="content-type"/>
<meta charset="utf-8"/>
<meta content="ie=edge" http-equiv="x-ua-compatible"/>
<meta content="max-age=0,no-cache" http-equiv="Cache-Control"/>
<base href="/"/>
<title ng-bind="PageTitle">
Thomson Reuters | PermID
</title>
<meta content="" name="description"/>
<meta content="width=device-width, initial-scale=1" name="viewport"/>
<meta content="#ff8000" name="theme-color"/>
<!-- Place favicon.ico and apple-touch-icon.png in the root directory -->
<link href="app/vendor.daf96efe.css" rel="stylesheet"/>
<link href="app/app.1405210f.css" rel="stylesheet"/>
<link href="favicon.ico" rel="icon"/>
<!-- Typekit -->
<script src="//use.typekit.net/gnw2rmh.js">
</script>
<script>
try{Typekit.load({async:true});}catch(e){}
</script>
<!-- // Typekit -->
<!-- Google Tag Manager Data Layer -->
<!--<script>
analyticsEvent = function() {};
analyticsSocial = function() {};
analyticsForm = function() {};
dataLayer = [];
</script>-->
<!-- // Google Tag Manager Data Layer -->
</head>
<body class="theme-grey" id="top" ng-esc="">
<!--[if lt IE 7]>
<p class="browserupgrade">You are using an <strong>outdated</strong> browser. Please upgrade your browser to improve your experience.</p>
<![endif]-->
<!-- Add your site or application content here -->
<navbar class="tms-navbar">
</navbar>
<div id="body" role="main" ui-view="">
</div>
<div id="footer-wrapper" ng-show="!params.elementsToHide">
<footer id="main-footer">
</footer>
</div>
<!--[if lt IE 9]>
<script src="bower_components/es5-shim/es5-shim.js"></script>
<script src="bower_components/json3/lib/json3.min.js"></script>
<![endif]-->
<script src="app/vendor.8cc12370.js">
</script>
<script src="app/app.6e5f6ce8.js">
</script>
</body>
</html>
Does anyone know what I'm missing here and how I could get it to work?
Thanks, Teemu Risikko - a comment (albeit not the solution) of the website you linked got me on the right path.
In case someone else is bumping into the same problem, here is my solution: I'm getting the data via requests and not via traditional "scraping" (e.g. BeautifulSoup or lxml).
Navigate to the website using Google Chrome.
Right-click on the website and select "Inspect".
On the top navigation bar select "Network".
Limit network monitor to "XHR".
One of the entries (market with an arrow) shows the link that can be used with the requests library.
import requests
url = 'https://permid.org/api/mdaas/getEntityById/21475776041'
headers = {'X-AG-Access-Token': YOUR_ACCESS_TOKEN}
r = requests.get(url, headers=headers)
r.json()
Which gets me this:
{u'Asset Class': [u'Units'],
u'Asset Class URL': [u'https://permid.org/1-302043'],
u'Currency': [u'CAD'],
u'Currency URL': [u'https://permid.org/1-500140'],
u'Exchange': [u'TOR'],
u'IsQuoteOf.mdaas': [{u'Is Quote Of': [u'Convertible Debentures Income Units'],
u'URL': [u'https://permid.org/1-21475768667'],
u'quoteOfInstrument': [u'21475768667'],
u'quoteOfInstrument URL': [u'https://permid.org/1-21475768667']}],
u'Mic': [u'XTSE'],
u'PERM ID': [u'21475776041'],
u'Quote Name': [u'CONVERTIBLE DEBENTURES INCOME UNT'],
u'Quote Type': [u'equity'],
u'RIC': [u'OCV_u.TO'],
u'Ticker': [u'OCV.UN'],
u'entityType': [u'Quote']}
Using the default user-agent with a lot of pages will give you a different looking page because it is using an outdated user-agent. This is what your output is telling you.
Reference on Changing user-agents
Thought this may be your problem, it does not exactly answer the question about getting dynamically applied changes on a webpage. To get the dynamically changed data you need to emulate the javascript requests that the page is making on load. If you make the requests that the javascript is making you will get the data that the javascript is getting.

Scraping un-closed meta tags with BS4

I am trying to get the content of a meta tag. The problem is that BS4 can't parse the tag properly on some sites, where the tag is not closed as it should be. With tags as the example below, the output of my function includes tons of clutter including other tags such as scripts, links, etc. I believe the browser closes automatically the meta tag somewhere in the end of the head and this behavior confuses BS4.
My code works with this:
<meta name="description" content="content" />
and doesn't work with:
<meta name="description" content="content">
Here is the code of my BS4 function:
from bs4 import BeautifulSoup
html = BeautifulSoup(open('/path/file.html'), 'html.parser')
desc = html.find(attrs={'name':'description'})
print(desc)
Any way to make it work with those un-closed meta tags?
html5lib or lxml parser would handle the problem properly:
In [1]: from bs4 import BeautifulSoup
...:
...: data = """
...: <html>
...: <head>
...: <meta name="description" content="content">
...: <script>
...: var i = 0;
...: </script>
...: </head>
...: <body>
...: <div id="content">content</div>
...: </body>
...: </html>"""
...:
In [2]: BeautifulSoup(data, 'html.parser').find(attrs={'name': 'description'})
Out[2]: <meta content="content" name="description">\n<script>\n var i = 0;\n </script>\n</meta>
In [3]: BeautifulSoup(data, 'html5lib').find(attrs={'name': 'description'})
Out[3]: <meta content="content" name="description"/>
In [4]: BeautifulSoup(data, 'lxml').find(attrs={'name': 'description'})
Out[4]: <meta content="content" name="description"/>
Having get something new and hope it can give you some help, i think every time BeautifulSoup find an element without a proper end tag, then it will continue searching the next and next element until its parent tag end tag.Maybe you still don't understand my thought, and here i made a little demo:
hello.html
<!DOCTYPE html>
<html lang="en">
<meta name="description" content="content">
<head>
<meta charset="UTF-8">
<title>Title</title>
</head>
<div>
<p class="title"><b>The Dormouse's story</b>
<p class="story">Once upon a time there were three little sisters; and their names were
Elsie,
Lacie and
Tillie;
and they lived at the bottom of a well.</p>
</p></div>
</body>
</html>
and run like you done before and find result below:
<meta content="content" name="description">
<head>
<meta charset="utf-8">
<title>Title</title>
</meta></head>
<body>
...
</div></body>
</meta>
ok! BeautifulSoup generate the end meta tag automatically and whose position is after the </body> tag, but still can not see meta's parent end tag </html>, so what i mean is that end tag should reflect as the same position as its start tag. But i still can not convince myself such opinion so i make a test, delete <p class='title'> end tag so there is only one </p> tag in <div>...</div>, but after running
c = soup.find_all('p', attrs={'class':'title'})
print(c[0])
there are two </p> tags in result. So that's true as i said previously.

Categories