Python BeautifulSoup issue parsing table - python

Hi I am using beautifulsoup to parse tables in the following website, but not all the rows are getting returned. I looking for article tags (http://itp.ne.jp/result/?kw=%92J%98e%8E%95%89%C8%83N%83%8A%83j%83b%83N)
url = 'http://itp.ne.jp/result/?kw=%92J%98e%8E%95%89%C8%83N%83%8A%83j%83b%83N'
page = requests.get(url)
prefsoup = BeautifulSoup(page.content,"html.parser")
art= prefsoup.find_all("article")
print(art)
[<article>
<section class="noimage">
<h4 class="clearfix">
<a class="blackText" href="/shop/KN0114031400001406/" target="_blank">谷脇歯科クリニック</a>
<a class="itrademark24" href="/stats_click/?s_bid=KN0114031400001406&s_sid=FSP-LSR-001&s_fr=V09&s_ck=C12&s_acd=7" target="_blank"><img alt="付加価値情報" src="/img/pc/shop/icon_itrade_7.gif"/></a>
</h4>
<p><span class="inlineSmallHeader">住所</span> 〒060-0042 北海道札幌市中央区大通西5丁目 <a class="boxedLink navigationLink" href="/shop/KN0114031400001406/map.html" target="_blank">地図・ナビ</a></p>
<p><span class="inlineSmallHeader">TEL</span>
<a class="whiteboxicon popup_04" href="/guide/phonemark.html">(代)</a>
<b>011-213-1184</b></p>
<p>
<span class="inlineSmallHeader">URL</span>
http://taniwaki-dental.com</p></section></article>]
However it is missing the last paragraph with the email information
<p><span class="inlineSmallHeader">EMAIL</span>
taniwaki#kzh.biglobe.ne.jp<!-- br-->
</p>
Moreover len(art) returns a 2, and art[1] returns an index out of range error.
Tried several pages and got the same issue.

Use the parser html5lib instead of html.parser and it will work like a charm. You just need to change the following line of code -
prefsoup = BeautifulSoup(page.content,"html.parser")
to -
prefsoup = BeautifulSoup(page.content,"html5lib")
Of course, you will need to install the html5lib using pip install html5lib.
Check this as well - https://www.crummy.com/software/BeautifulSoup/bs4/doc/#installing-a-parser

Related

Beautiful Soup Query with Filtering in Python

I am trying to save the contents of each article in its own text file. What I am having trouble with is coming up with a beautiful soup approach that returns articles of the type News only while ignoring the other article types.
Website in question: https://www.nature.com/nature/articles
Info
Every article is enclosed in a pair of <article> tags
Each article type is hidden inside a <span> tag containing the data-test attribute with the article.type value.
Title to the article is placed inside the <a> tag with the data-track-label="link" attribute.
The article body wrapped in the <div> tag (look for "body" in the class attribute).
Current code
I was able to get up to the point where I can query the <span> for articles of the News type, but am struggling to take the next steps to return the other article specific information.
How can I take this further? For the articles of the the type News, I'd like to also be able to return that article's title and body while ignoring the other articles that are not of type News?
# Send HTTP requests
import requests
from bs4 import BeautifulSoup
class WebScraper:
#staticmethod
def get_the_source():
# Obtain the URL
url = 'https://www.nature.com/nature/articles'
# Get the webpage
r = requests.get(url)
# Check response object's status code
if r:
the_source = open("source.html", "wb")
soup = BeautifulSoup(r.content, 'html.parser')
type_news = soup.find_all("span", string='News')
for i in type_news:
print(i.text)
the_source.write(r.content)
the_source.close()
print('\nContent saved.')
else:
print(f'The URL returned {r.status_code}!')
WebScraper.get_the_source()
Sample HTML for an article that is of type News
The source code has 19 other articles with similar and different article types.
<article class="u-full-height c-card c-card--flush" itemscope itemtype="http://schema.org/ScholarlyArticle">
<div class="c-card__image">
<picture>
<source
type="image/webp"
srcset="
//media.springernature.com/w165h90/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 160w,
//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg?as=webp 290w"
sizes="
(max-width: 640px) 160px,
(max-width: 1200px) 290px,
290px">
<img src="//media.springernature.com/w290h158/magazine-assets/d41586-021-00485-2/d41586-021-00485-2_18927840.jpg"
alt=""
itemprop="image">
</picture>
</div>
<div class="c-card__body u-display-flex u-flex-direction-column">
<h3 class="c-card__title" itemprop="name headline">
<a href="/articles/d41586-021-00485-2"
class="c-card__link u-link-inherit"
itemprop="url"
data-track="click"
data-track-action="view article"
data-track-label="link">Mars arrivals and Etna eruption — February's best science images</a>
</h3>
<div class="c-card__summary u-mb-16 u-hide-sm-max"
itemprop="description">
<p>The month’s sharpest science shots, selected by <i>Nature's</i> photo team.</p>
</div>
<div class="u-mt-auto">
<ul data-test="author-list" class="c-author-list c-author-list--compact u-mb-4">
<li itemprop="creator" itemscope="" itemtype="http://schema.org/Person"><span itemprop="name">Emma Stoye</span></li>
</ul>
<div class="c-card__section c-meta">
<span class="c-meta__item c-meta__item--block-at-xl" data-test="article.type">
<span class="c-meta__type">News</span>
</span>
<time class="c-meta__item c-meta__item--block-at-xl" datetime="2021-03-05" itemprop="datePublished">05 Mar 2021</time>
</div>
</div>
</div>
</article>
</div>
</li>
<li class="app-article-list-row__item">
<div class="u-full-height" data-native-ad-placement="false">
The simplest way, and you get more results per hit, is to add News into the query string as a param
https://www.nature.com/nature/articles?type=news
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nature.com/nature/articles?type=news')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item')
for n in news_articles:
print(n.select_one('.c-card__link').text)
A variety of params for page 2 of news:
https://www.nature.com/nature/articles?searchType=journalSearch&sort=PubDate&type=news&page=2
If you monitor the browser network tab whilst manually filtering on the page, or
selecting different pages numbers, you can see the logic of how the querystrings are constructed and tailor your requests accordingly e.g.
https://www.nature.com/nature/articles?type=news&year=2021
Otherwise, you could do more convoluted (in/ex)clusion with css selectors, based on whether based on whether article nodes have a specific child containing "News" (inclusion); exclusion beings News with another word/symbol (as per categories list):
import requests
from bs4 import BeautifulSoup as bs
r = requests.get('https://www.nature.com/nature/articles')
soup = bs(r.content, 'lxml')
news_articles = soup.select('.app-article-list-row__item:has(.c-meta__type:contains("News"):not( \
:contains("&"), \
:contains("in"), \
:contains("Career"), \
:contains("Feature")))') #exclusion n
for n in news_articles:
print(n.select_one('.c-card__link').text)
You can remove categories from the :not() list if you want News & or News In etc...
if you don't want to filter the URL, loop through <article> then check element text for class c-meta__type
articles = soup.select('article')
for article in articles:
article_type = article.select_one('.c-meta__type').text.strip()
if article_type == 'News':
# or if type contain News
# if 'News' in article_type:
title = article.select_one('a').text
summary = article.select_one('.c-card__summary p').text
print("{}: {}\n{}\n\n".format(article_type, title, summary))

How to extract data(text) using beautiful soup when they are in the same class?

I'm working on a personal project where I scrape data from a website. I'm trying to use beautiful soup to do this but I came across data in the same class but a different attribute. For example:
<div class="pi--secondary-price">
<span class="pi--price">$11.99 /<abbr title="Kilogram">kg</abbr></span>
<span class="pi--price">$5.44 /<abbr title="Pound">lb.</abbr></span>
</div>
How do I just get $11.99/kg? Right now I'm getting
$11.99 /kg
$5.44 /lb.
I've done x.select('.pi--secondary-price') but it returns both prices. How do I only get 1 price ($11.99 /kg)?
You could first get the <abbr> tag and then search for the respective parent tag. Like this:
from bs4 import BeautifulSoup
html = '''
<div class="pi--secondary-price">
<span class="pi--price">$11.99 /<abbr title="Kilogram">kg</abbr></span>
<span class="pi--price">$5.44 /<abbr title="Pound">lb.</abbr></span>
</div>
'''
soup = BeautifulSoup(html, 'html.parser')
kg = soup.find(title="Kilogram")
print(kg.parent.text)
This gives you the desired output $11.99 /kg. For more information, see the BeautifulSoup documentation.

Python HTML Parsing with BS4

I'm having the problem of trying to parse through HTML using Python & Beautiful Soup and I'm encountering the problem of which I want to parse for a very specific piece of data. This is the kind of code I'm encountering:
<div class="big_div">
<div class="smaller div">
<div class="other div">
<div class="this">A</div>
<div class="that">2213</div>
<div class="other div">
<div class="this">B</div>
<div class="that">215</div>
<div class="other div">
<div class="this">C</div>
<div class="that">253</div>
There is a series of repeat HTML as you can see with only the values being different, my problem is locating a specific value. I want to locate the 253 in the last div. I would appreciate any help as this is a recurring problem in parsing through HTML.
Thank you in advance!
So far I've tried to parse for it but because the names are the same I have no idea how to navigate through it. I've tried using the for loop too but made little to no progress at all.
You can use string attribute as argument in find. BS docs for string attr.
"""Suppose html is the object holding html code of your web page that you want to scrape
and req_text is some text that you want to find"""
soup = BeautifulSoup(html, 'lxml')
req_div = soup.find('div', string=req_text)
req_div will contain the div element which you want.

How to get the string from "chrome://downloads" page

I used Chromedriver to download the file, then I would like to parse the "chrome://downloads" to get download status, but I can't get the string, please refer to below code and result. I also checked the HTML in Chrome. I could saw <span id="name">Noto-hinted (1).zip</span>, but if I used the view page source, I can't find the string "Noto-hinted (1).zip". It is <span id="name" hidden="[[completelyOnDisk_]]">[[data.file_name]]</span>
import time, bs4
from selenium import webdriver
url = "https://noto-website.storage.googleapis.com/pkgs/Noto-hinted.zip"
browser = webdriver.Chrome()
browser.get(url)
time.sleep(5)
browser.get("chrome://downloads/")
time.sleep(30)
soup = bs4.BeautifulSoup(browser.page_source,"lxml")
webElemlist = soup.find('span', id='name')
print(webElemlist)
time.sleep(300)
browser.quit()
Output:
<span id="name"> </span>
I change the 'lxml' to 'html', I got the warning messages as below and still can't get the strings.
Warning (from warnings module):
File "C:\Python362\lib\site-packages\bs4__init__.py", line 181
markup_type=markup_type))
UserWarning: No parser was explicitly specified, so I'm using the best available HTML parser for this system ("lxml"). This usually isn't a problem, but if you run this code on another system, or in a different virtual environment, it may use a different parser and behave differently.
The code that caused this warning is on line 1 of the file . To get rid of this warning, change code that looks like this:
BeautifulSoup(YOUR_MARKUP})
to this:
BeautifulSoup(YOUR_MARKUP, "lxml")
Are you trying to get the downloading item from the screen?
Right-click the element you are trying to click, and select 'Inspect'.
This will open the console and you can see the specific tags for each element of the page, as you hover over them.
I found this for the package:
<div id="title-area">
<a is="action-link" id="file-link" tabindex="0" role="link" hidden="" href="https://noto-website.storage.googleapis.com/pkgs/Noto-hinted.zip">Noto-hinted.zip</a>
<span id="name">Noto-hinted.zip</span>
<span id="tag"></span>
</div>
All you need to do is get the text for these tags using the IDs. This also applies after you have downloaded the file.
Edit:
test = """
<div id="title-area">
<a is="action-link" id="file-link" tabindex="0" role="link" hidden="" href="https://noto-website.storage.googleapis.com/pkgs/Noto-hinted.zip">Noto-hinted.zip</a>
<span id="name">Noto-hinted.zip</span>
<span id="tag"></span>
</div>
"""
soup = BeautifulSoup(test, "lxml")
fileDiv = soup.find("span", {"id": "name"}).text
print(fileDiv)
If the above does not work, try doing this:
soup = bs4.BeautifulSoup(browser.page_source,"html.parser")

I am not able to parse using Beautiful Soup

<td>
<a name="corner"></a>
<div>
<div style="aaaaa">
<div class="class-a">My name is alis</div>
</div>
<div>
<span><span class="class-b " title="My title"><span>Very Good</span></span> </span>
<b>My Description</b><br />
My Name is Alis I am a python learner...
</div>
<div class="class-3" style="style-2 clear: both;">
alis
</div>
</div>
<br /></td>
I want the description after scraping it:
My Name is Alis I am a python learner...
I tried a lots of thing but i could not figure it out the best way. Can you guys give the in general solution for this.
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("Your sample html here")
soup.td.div('div')[2].contents[-1]
This will return the string you are looking for (the unicode string, with any applicable whitespace, it should be noted).
This works by parsing the html, grabbing the first td tag and its contents, grabbing any div tags within the first div tag, selecting the 3rd item in the list (list index 2), and grabbing the last of its contents.
In BeautifulSoup, there are A LOT of ways to do this, so this answer probably hasn't taught you much and I genuinely recommend you read the tutorial that David suggested.
Have you tried reading the examples provided in the documentation? They quick start is located here http://www.crummy.com/software/BeautifulSoup/documentation.html#Quick Start
Edit:
To find
You would load your html up via
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup("My html here")
myDiv = soup.find("div", { "class" : "class-a" })
Also remember you can do most of this via the python console and then using dir() along with help() walk through what you're trying to do. It might make life easier on you to try out ipython or perhaps python IDLE which have very friendly consoles for beginners.

Categories