I am using Python and Beautifulsoup to parse HTML-Data and get p-tags out of RSS-Feeds. However, some urls cause problems because the parsed soup-object does not include all nodes of the document.
For example I tried to parse http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm
But after comparing the parsed object with the pages source code, I noticed that all nodes after ul class="nextgen-left" are missing.
Here is how I parse the Documents:
from bs4 import BeautifulSoup as bs
url = 'http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm'
cj = cookielib.CookieJar()
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
request = urllib2.Request(url)
response = opener.open(request)
soup = bs(response,'lxml')
print soup
The input HTML is not quite conformant, so you'll have to use a different parser here. The html5lib parser handles this page correctly:
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get('http://feeds.chicagotribune.com/~r/ChicagoBreakingNews/~3/T2Zg3dk4L88/story01.htm')
>>> soup = BeautifulSoup(r.text, 'lxml')
>>> soup.find('div', id='story-body') is not None
False
>>> soup = BeautifulSoup(r.text, 'html5')
>>> soup.find('div', id='story-body') is not None
True
Related
I am trying to scrape this page https://ntrs.nasa.gov/search .
I am using the code below and Beautiful soup is finding only 3 tags when there are many more. I have tried using html5lib, lxml and HTML parsers but none of them have worked.
Can you advise what might be the problem please?
import requests
import urllib.request
import time
from bs4 import BeautifulSoup
# Set the URL
url = 'https://ntrs.nasa.gov/search'
# Connect to the URL
response = requests.get(url)
# Parse HTML and save to a BeautifulSoup object¶
soup = BeautifulSoup(response.content, "html5lib")
# soup = BeautifulSoup(response.text, "html5lib")
# soup = BeautifulSoup(response.content, "html.parser")
# soup = BeautifulSoup(response.content, "lxml")
# loop through all a-tags
for a_tag in soup.findAll('a'):
if 'title' in a_tag:
if a_tag['title'] == 'Download Document':
link = a_tag['href']
download_url = 'https://ntrs.nasa.gov' + link
urllib.request.urlretrieve(download_url,'./'+link[link.find('/citations/')+1:11])
It is dynamically pulled from a script tag. You can regex out the JavaScript object which contains the download url, handle some string replacements for html entities, parse as json then extract the desired url:
import requests, re, json
r = requests.get('https://ntrs.nasa.gov/search')
data = json.loads(re.search(r'(\{.*/api.*\})', r.text).group(1).replace('&q;','"'))
print('https://ntrs.nasa.gov' + data['http://ntrs-proxy-auto-deploy:3001/citations/search']['results'][0]['downloads'][0]['links']['pdf'])
You could append the ?attachment=true but I don't think that is required.
Your problem stems from the fact that the page is rendered using Javascipt, and the actual page source is only a few script and style tags.
I'm having some trouble figuring out how to parse HTML that's contained within the response of an API call in Python 3.7 (requests + BS4).
Say I want to parse out the article URLs from a response like this one.
I'm able to get the "rendering" entry of the response which seemingly contains the HTML I'd like to parse, however, when I pass the text along to Beautiful Soup's HTML parser, it does not seem to work as expected (unable to locate HTML tags of any kind):
import requests
from bs4 import BeautifulSoup
url = """https://www.washingtonpost.com/pb/api/v2/render/feature/?service=prism-query&contentConfig={%22url%22:%22prism://prism.query/ap-articles-by-site-id,/world%22,%22offset%22:0,%22limit%22:5}&customFields={%22isLoadMore%22:false,%22offset%22:0,%22maxToShow%22:50,%22dedup%22:true}&id=f00boImX29Vv3s&rid=&uri=/world/"""
r = requests.get(url).json()
soup = BeautifulSoup(r['rendering'], 'html.parser')
links_html = soup.find_all("div", attrs={"class":"headline x-small normal-style text-align-inherit "})
links = []
for div in links_html:
links.append(div.find('a', href = True)['href'])
Am I wrong in my assumption that the "rendering" entry in the response is raw HTML?
You want to use the json library (or in hindsight, Request.json()), because whatever link you're visiting isn't actually a website, but what seems to be an api on top of it that gives you the html along with encoding, content type, and some other things that won't be necessary.
Here's how I did it.
>>> import requests
>>> from bs4 import BeautifulSoup
>>> r = requests.get("https://www.washingtonpost.com/pb/api/v2/render/feature/?service=prism-query&contentConfig=%7B%22url%22:%22prism://prism.query/ap-articles-by-site-id,/world%22,%22offset%22:0,%22limit%22:5%7D&customFields=%7B%22isLoadMore%22:false,%22offset%22:0,%22maxToShow%22:50,%22dedup%22:true%7D&id=f00boImX29Vv3s&rid=&uri=/world/")
>>> bs = BeautifulSoup(r.content, 'html.parser')
>>> first_div = bs.find("div", class_="moat-trackable")
>>> first_div
>>> import json
>>> html_dict = json.loads(r.content)
>>> html_dict
{'rendering': '<div class="moat-trackable ...'}
>>> html_dict.keys()
dict_keys(['rendering', 'encoding', 'contentType', 'pageResources', 'externalResources', 'httpHeaders'])
>>> bs = BeautifulSoup(html_dict["rendering"], 'html.parser')
>>> first_div = bs.find("div", class_="moat-trackable")
>>> first_div
<div class="moat-trackable
I am new to BeautifulSoup and I am praticing with little tasks. Here I try to get the "previous" link in this site. The html is
here
My code is
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.find('div', id="comic")
url2 = result.find('ul', class_='comicNav').find('a', rel='prev').find('href')
But it shows NoneType.. I have read some posts about the child elements in html, and I tried some different things. But it still does not work.. Thank you for your help in advance.
Tou could use a CSS Selector instead.
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
result = soup.select('.comicNav a[rel~="prev"]')[0]
print(result)
if you want just the href change
result = soup.select('.comicNav a[rel~="prev"]')[0]["href"]
To get prev link.find ul tag and then find a tag. Try below code.
import requests, bs4
from bs4 import BeautifulSoup
url = 'https://www.xkcd.com/'
page = requests.get(url)
soup = BeautifulSoup(page.content, 'html.parser')
url2 = soup.find('ul', class_='comicNav').find('a',rel='prev')['href']
print(url2)
Output:
/2254/
Working on a partial answer to this question, I came across a bs4.element.Tag that is a mess of nested dicts and lists (s, below).
Is there a way to return a list of urls contained in s without using re.find_all? Other comments regarding the structure of this tag are helpful too.
from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
## the first bit of s:
# s
# Out[116]:
# <script type="application/ld+json">
# {"#context":"http://schema.org","#type":"ItemList","numberOfItems":50,
What I've tried:
randomly perusing through methods with tab completion on s.
picking through the docs.
My problem is that s only has 1 attribute (type) and doesn't seem to have any child tags.
You can use s.text to get the content of the script. It's JSON, so you can then just parse it with json.loads. From there, it's simple dictionary access:
import json
from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
urls = [el['url'] for el in json.loads(s.text)['itemListElement']]
print(urls)
More easy:
from bs4 import BeautifulSoup
import requests
link = 'https://stackoverflow.com/jobs?med=site-ui&ref=jobs-tab&sort=p'
r = requests.get(link)
soup = BeautifulSoup(r.text, 'html.parser')
s = soup.find('script', type='application/ld+json')
# JUST THIS
json = json.loads(s.string)
When I inspect the elements on my browser, I can obviously see the exact web content. But when I try to run the below script, I cannot see the some of the web page details. In the web page I see there are "#document" elements and that is missing while I run the script. How can I see the details of #document elements or extract with the script.?
from bs4 import BeautifulSoup
import requests
response = requests.get('http://123.123.123.123/')
soup = BeautifulSoup(response.content, 'html.parser')
print soup.prettify()
You need to make additional requests to get the frame page contents as well:
from urlparse import urljoin
from bs4 import BeautifulSoup
import requests
BASE_URL = 'http://123.123.123.123/'
with requests.Session() as session:
response = session.get(BASE_URL)
soup = BeautifulSoup(response.content, 'html.parser')
for frame in soup.select("frameset frame"):
frame_url = urljoin(BASE_URL, frame["src"])
response = session.get(frame_url)
frame_soup = BeautifulSoup(response.content, 'html.parser')
print(frame_soup.prettify())