how to use beautifulSoup to extract html5 elements such as <section> - python

I intend to extract the article text from an NYT article. However I don't know how to extract by html5 tags such as section name.
import urllib.request
from bs4 import BeautifulSoup
html = urllib.request.urlopen('https://www.nytimes.com/2019/10/24/opinion/chuck-schumer-electric-car.html?action=click&module=Opinion&pgtype=Homepage')
soup = BeautifulSoup(html)
data = soup.findAll(text=True)
The main text is wrapped in a section named 'articleBody'. What kind of soup.find() syntax can I use to extract that?

The find method searches tags, it doesn't differentiate HTML5 from any other (X)HTML tag name
article = soup.find("section",{"name":"articleBody"})

You can scrape the pre-loaded data from script tag and parse with json library. The first code block brings back a little more content than you wanted.
You can further restrict by looking up ids of paragraphs within body, and use those to filter content, as shown in bottom block; You then get exactly the article content you describe.
import requests, re, json
r = requests.get('https://www.nytimes.com/2019/10/24/opinion/chuck-schumer-electric-car.html?action=click&module=Opinion&pgtype=Homepage')
p = re.compile(r'window\.__preloadedData = (.*})')
data = json.loads(p.findall(r.text)[0])
for k,v in data['initialState'].items():
if k.startswith('$Article') and 'formats' in v:
print(v['text#stripHtml'] if 'text#stripHtml' in v else v['text'])
You can explore the json here: https://jsoneditoronline.org/?id=f9ae1fb774af439d8e9b32247db9d853
The following shows how to use additional logic to limit to just output you want:
ids = []
for k,v in data['initialState'].items():
if k.startswith('$Article') and v['__typename'] == 'ParagraphBlock' and 'content' in v:
ids += [v['content'][0]['id']]
for k,v in data['initialState'].items():
if k in ids:
print(v['text'])

Related

Web scrape data in performance gauge

I'm trying to scrape data in a widget using python and the requests-html library.
The the value I want is in a gauge with an arrow pointing to five possible results.
Each label on the gauge is the same on all pages of the website. The problem I face is I cannot use a css selector on the gauge labels to extract the text, I need to extract the value of the arrow itself as it will be pointing to a label. The arrow doesn't have a text attribute so if I use a css selector I get none as a response.
Each arrow has a unique class name.
<div class="arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo arrowStrongBuyShudder-3xsGK8k5">
https://www.tradingview.com/symbols/NASDAQ-MDB/
StrongBuy:
<div class="arrow-F-uE7IX8 arrowToBuy-1R7d8UMJ arrowBuyShudder-3GMCnG5u">
https://www.tradingview.com/symbols/NYSE-XOM/
Buy:
<div class="arrow-F-uE7IX8 arrowToStrongSell-3UWimXJs arrowStrongSellShudder-2UJhm0_C">
https://www.tradingview.com/symbols/NASDAQ-IDEX/
StrongSell:
What can I do to ensure I get the correct value? I'm not sure how I can check if the selector contains the arrowTo{foo} and store as variable.
import pyppdf.patch_pyppeteer
from requests_html import AsyncHTMLSession
asession = AsyncHTMLSession()
async def get_page():
code = 'NASDAQ-MDB'
r = await asession.get(f'https://www.tradingview.com/symbols/{code}/')
await r.html.arender(wait=3)
return r
results = asession.run(get_page)
for result in results:
arrow_class_placeholder = "//div[contains(#class,'arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo')]//div[1]"
arrow_class_name = result.html.xpath(arrow_class_placeholder,first=True)
if arrow_class_name == "//div[contains(#class,'arrow-F-uE7IX8 arrowToStrongBuy-1ydGKDOo')]//div[1]":
print('StrongBuy')
else:
print('not strong buy')
You can use BeautifulSoup4 (bs4), which is a a Python library for pulling data out of HTML and XML files, with a combination of Regular Expressions (RegEx). In this case I used the python re library for the RegEx purposes.
Something like this is what you want (source):
In the example above soup.find_all(class_=re.compile("itle")) returns all instances where the word "itle" is found in the class tag, such as class = "title" from the html document shown below.
For your RegEx it would look something like "arrowTo*" or even just "arrowTo". soup.find_all(class_=re.compile("arrowTo")).
Your final code should look something like:
from bs4 import BeautifulSoup
#i think result was your html document from requests library
#the first parameter is your html document variable
soup = BeautifulSoup(result, 'html.parser')
myArrowToList = soup.find_all(class_=re.compile("arrowTo"))
If you wanted "arrowToStrongBuy" just use that in the regex input to the find_all function.
soup.find_all(class_=re.compile("arrowToStrongBuy"))

Unable to extract content from DOM element with $0 thru BeautifulSoup

Here is the website I am to scrape the number of reviews
So here i want to extract number 272 but it returns None everytime .
I have to use BeautifulSoup.
I tried-
sources = requests.get('https://www.thebodyshop.com/en-us/body/body-butter/olive-body-butter/p/p000016')
soup = BeautifulSoup(sources.content, 'lxml')
x = soup.find('div', {'class': 'columns five product-info'}).find('div')
print(x)
output - empty tag
I want to go inside that tag further.
The number of reviews is dynamically retrieved from an url you can find in network tab. You can simply extract from response.text with regex. The endpoint is part of a defined ajax handler.
You can find a lot of the API instructions in one of the js files: https://thebodyshop-usa.ugc.bazaarvoice.com/static/6097redes-en_us/bvapi.js
For example:
You can trace back through a whole lot of jquery if you really want.
tl;dr; I think you need only add the product_id to a constant string.
import requests, re
from bs4 import BeautifulSoup as bs
p = re.compile(r'"numReviews":(\d+),')
ids = ['p000627']
with requests.Session() as s:
for product_id in ids:
r = s.get(f'https://thebodyshop-usa.ugc.bazaarvoice.com/6097redes-en_us/{product_id}/reviews.djs?format=embeddedhtml')
p = re.compile(r'"numReviews":(\d+),')
print(int(p.findall(r.text)[0]))

Beautiful Soup Nested Tag Search

I am trying to write a python program that will count the words on a web page. I use Beautiful Soup 4 to scrape the page but I have difficulties accessing nested HTML tags (for example: <p class="hello"> inside <div>).
Every time I try finding such tag using page.findAll() (page is Beautiful Soup object containing the whole page) method it simply doesn't find any, although there are. Is there any simple method or another way to do it?
Maybe I'm guessing what you are trying to do is first looking in a specific div tag and the search all p tags in it and count them or do whatever you want. For example:
soup = bs4.BeautifulSoup(content, 'html.parser')
# This will get the div
div_container = soup.find('div', class_='some_class')
# Then search in that div_container for all p tags with class "hello"
for ptag in div_container.find_all('p', class_='hello'):
# prints the p tag content
print(ptag.text)
Hope that helps
Try this one :
data = []
for nested_soup in soup.find_all('xyz'):
data = data + nested_soup.find_all('abc')
Maybe you can turn in into lambda and make it cool, but this works. Thanks.
UPDATE: I noticed that text does not always return the expected result, at the same time, I realized there was a built-in way to get the text, sure enough reading the docs
we read that there is a method called get_text(), use it as:
from bs4 import BeautifulSoup
fd = open('index.html', 'r')
website= fd.read()
fd.close()
soup = BeautifulSoup(website)
contents= soup.get_text(separator=" ")
print "number of words %d" %len(contents.split(" "))
INCORRECT, please read above.Supposing that you have your html file locally in index.html you can:
from bs4 import BeautifulSoup
import re
BLACKLIST = ["html", "head", "title", "script"] # tags to be ignored
fd = open('index.html', 'r')
website= fd.read()
soup = BeautifulSoup(website)
tags=soup.find_all(True) # find everything
print "there are %d" %len(tags)
count= 0
matcher= re.compile("(\s|\n|<br>)+")
for tag in tags:
if tag.name.lower() in BLACKLIST:
continue
temp = matcher.split(tag.text) # Split using tokens such as \s and \n
temp = filter(None, temp) # remove empty elements in the list
count +=len(temp)
print "number of words in the document %d" %count
fd.close()
Please note that it may not be accurate, maybe because of errors in formatting, false positives(it detects any word, even if it is code), text that is shown dynamically using javascript or css, or other reason
You can find all <p> tags using regular expressions (re module).
Note that r.content is a string which contains the whole html of the site.
for eg:
r = requests.get(url,headers=headers)
p_tags = re.findall(r'<p>.*?</p>',r.content)
this should get you all the <p> tags irrespective of whether they are nested or not. And if you want the a tags specifically inside the tags you can add that whole tag as a string in the second argument instead of r.content.
Alternatively if you just want just the text you can try this:
from readability import Document #pip install readability-lxml
import requests
r = requests.get(url,headers=headers)
doc = Document(r.content)
simplified_html = doc.summary()
this will get you a more bare bones form of the html from the site, and now proceed with the parsing.

using lxml and requests in python to grab text between certain tags with a specific class name

I am trying to grab all text between a tag that has a specific class name. I believe I am very close to getting it right, so I think all it'll take is a simple fix.
In the website these are the tags I'm trying to retrieve data from. I want 'SNP'.
<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>
From what I have currently:
from lxml import html
import requests
def main():
url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
page = html.fromstring(requests.get(url_link).text)
for span_tag in page.xpath("//span"):
class_name = span_tag.get("class")
if class_name is not None:
if "rtq_exch" == class_name:
print(url_link, span_tag.text)
if __name__ == "__main__":main()
I get this:
http://finance.yahoo.com/q?s=^GSPC&d=t None
To show that it works, when I change this line:
if "rtq_dash" == class_name:
I get this (please note the '-' which is the same content between the tags):
http://finance.yahoo.com/q?s=^GSPC&d=t -
What I think is happening is it sees the child tag and stops grabbing the data, but I'm not sure why.
I would be happy with receiving
<span class="rtq_dash">-</span>SNP
as a string for span_tag.text, as I can easily chop off what I don't want.
A higher description, I'm trying to get the stock symbol from the page.
Here is the documentation for requests, and here is the documentation for lxml (xpath).
I want to use xpath instead of BeautifulSoup for several reasons, so please don't suggest changing to use that library instead, not that it'd be any easier anyway.
There are some possible ways. You can find the outer span and return direct-child text node of it :
>>> url_link = "http://finance.yahoo.com/q?s=^GSPC&d=t"
>>> page = html.fromstring(requests.get(url_link).text)
>>> for span_text in page.xpath("//span[#class='rtq_exch']/text()"):
... print(span_text)
...
SNP
or find the inner span and get the tail :
>>> for span_tag in page.xpath("//span[#class='rtq_dash']"):
... print(span_tag.tail)
...
SNP
Use BeautifulSoup:
import bs4
html = """<span class="rtq_exch"><span class="rtq_dash">-</span>SNP </span>"""
soup = bs4.BeautifulSoup(html)
snp = list(soup.findAll("span", class_="rtq_exch")[0].strings)[1]

How to get the content from a certain <table> using python?

I have some <tr>s, like this:
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
I want to fetch the content without html tags, like:
yangfanhit
3155
Accepted
344K
219MS
C++
3940B
2012-10-02 16:42:45
Now I'm using the following code to deal with it:
response = urllib2.urlopen('http://poj.org/status', timeout=10)
html = response.read()
response.close()
pattern = re.compile(r'<tr align.*</tr>')
match = pattern.findall(html)
pat = re.compile(r'<td>.*?</td>')
p = re.compile(r'<[/]?.*?>')
for item in match:
for i in pat.findall(item):
print p.sub(r'', i)
print '================================================='
I'm new to regex and also new to python. So could you suggest some better methods to process it?
You could use BeautifulSoup to parse the html. To write the content of the table in csv format:
#!/usr/bin/env python
import csv
import sys
import urllib2
from bs4 import BeautifulSoup # $ pip install beautifulsoup4
soup = BeautifulSoup(urllib2.urlopen('http://poj.org/status'))
writer = csv.writer(sys.stdout)
for tr in soup.find('table', 'a')('tr'):
writer.writerow([td.get_text() for td in tr('td')])
Output
Run ID,User,Problem,Result,Memory,Time,Language,Code Length,Submit Time
10876151,yangfanhit,3155,Accepted,344K,219MS,C++,3940B,2012-10-02 16:42:45
10876150,BandBandRock,2503,Accepted,16348K,2750MS,G++,840B,2012-10-02 16:42:25
Also take a look at PyQuery. Very easy to pickup if you're familiar with jQuery. Here's an example that returns table header and data as list of dictionaries.
import itertools
from pyquery import PyQuery as pq
# parse html
html = pq(url="http://poj.org/status")
# extract header values from table
header = [header.text for header in html(".a").find(".in").find("td")]
# extract data values from table rows in nested list
detail = [[td.text for td in tr] for tr in html(".a").children().not_(".in")]
# merge header and detail to create list of dictionaries
result = [dict(itertools.izip(header, values)) for values in detail]
You really don't need to work with regex directly to parse html, see answer here.
Or see Dive into Python Chapter 8 about HTML Processing.
Why you are doing those things when you already got HTML/ XML parsers which does the job easily for you
Use BeautifulSoup. Considering what you want as mentioned in the above question, it can be done in 2-3 lines of code.
Example:
>>> from bs4 import BeautifulSoup as bs
>>> html = """
<tr align=center><td>10876151</td><td><a href=userstatus?user_id=yangfanhit>yangfanhit</a></td><td><a href=problem?id=3155>3155</a></td><td><font color=blue>Accepted</font></td><td>344K</td><td>219MS</td><td>C++</td><td>3940B</td><td>2012-10-02 16:42:45</td></tr>
<tr align=center><td>10876150</td><td><a href=userstatus?user_id=BandBandRock>BandBandRock</a></td><td><a href=problem?id=2503>2503</a></td><td><font color=blue>Accepted</font></td><td>16348K</td><td>2750MS</td><td>G++</td><td>840B</td><td>2012-10-02 16:42:25</td></tr>
"""
>>>soup = bs(html)
>>>soup.td
>>><td>10876151</td>

Categories