Python: extract certain class separately by using bs4 - python

<div class="michelinKeyBenefitsComp">
<section id="benefit-one-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Banana is yellow.</h4>
<div class="content">
<p>Yellow is my favorite color.</p>
<p> </p>
<p>I love Banana.</p>
</div>
</div>
</div>
</section>
<section id="benefit-two-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Apple is red.</h4>
<div class="content"><p>Red is not my favorite color.</p>
<p> </p>
<p>I don't like apple.</p>
</div>
</div>
</div>
</section>
</div>
I know how to extract all the text I want from this HTML. Here is my code:
for item in soup.find('div', {'class' : 'michelinKeyBenefitsComp'}):
try:
for tex in item.find_all('div', {'class' : 'col'}):
print(tex.text)
except:
pass
But what i would like to do is extract the content separately, so I can save them separately. The result is expected like this:
Banana is yellow.
Yellow is my favorite color.
I love Banana.
#save first
Apple is red.
Red is not my favorite color.
I don't like apple.
#save next
By the way, in this case, there are only 2 paragraph, but in other cases, there are probably three or more paragraphs. How can I extract them without knowing how many paragraphs they have? TIA

May be you should try this way for extracting text, you have div with unique_id, but for selecting section text inside it you can use classes for properly select text from particular div,
from bs4 import BeautifulSoup
text = """
<div class="michelinKeyBenefitsComp">
<section id="benefit-one-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Banana is yellow.</h4>
<div class="content">
<p>Yellow is my favorite color.</p>
<p> </p>
<p>I love Banana.</p>
</div>
</div>
</div>
</section>
<section id="benefit-two-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Apple is red.</h4>
<div class="content"><p>Red is not my favorite color.</p>
<p> </p>
<p>I don't like apple.</p>
</div>
</div>
</div>
</section>
</div>
"""
soup = BeautifulSoup(text, 'html.parser')
main_div = soup.find('div', class_='michelinKeyBenefitsComp')
for idx, div in enumerate(main_div.select('section > div.inner > div.col')):
with open('file_'+str(idx)+'.txt', 'w', encoding='utf-8') as f:
f.write(div.get_text())
#Output in separate file: file_1.txt> Banana is yellow.
# Yellow is my favorite color.
# I love Banana.

This should help.
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all("section", {"id": re.compile("benefit-[a-z]+-content")}):
with open(i["id"]+".txt", "a") as outfile: #Create filename based on section ID and write.
outfile.write("\n".join([i for i in i.text.strip().split("\n") if i.strip()]) + "\n\n")

Related

How to extract the data from encoded HTML class using python

How can I retrieve the page encoded div class of a webpage (title html tag) using Python?
Here my sample html code.
You need to use requests to make a request (it will automatically decode the page, in most cases), and beautifulsoup to extract the data from the HTML.
Update after OP clarifications. CSS classes are not dynamically updating, they're the same (that's what I noticed). Since they're the same, you can:
grab a container with all needed data (a container (CSS selector) that wraps needed data)
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
# ...
use regex to filter (find) all needed data via re.findall() and capture group (.*): only this match will be captured and returned. .*: means to capture everything.
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
# ...
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. On that note, there's a dedicated web scraping with CSS selectors blog post of mine.
Code and example in the online IDE:
import requests, re
from bs4 import BeautifulSoup
html = requests.get("https://sites.google.com/a/arden.solihull.sch.uk/futures/home")
soup = BeautifulSoup(html.text, "html.parser")
# all regular expressions for this task
# https://regex101.com/r/cxdxgq/1
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
if re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text):
name = "".join(re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text.strip()))
print(name)
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
telephone = "".join(re.findall(r"^Telephone\s?:\s?(.*)", result.text.strip()))
print(telephone)
if re.findall(r"^Email\s?:\s?(.*)", result.text):
email = "".join(re.findall(r"^Email\s?:\s?(.*)", result.text.strip()))
print(email)
# to scrape the role you can do the same thing with regex. Test on regex101.com
'''
Mrs A. Fallis
01564 773348
afallis#arden.solihull.sch.uk
Mr S. Brady
01564 7733478
sbrady#arden.solihull.sch.uk
'''
First solutions without OP clarifications (shows only extraction part since you haven't provided a website URL):
from bs4 import BeautifulSoup
html = """
<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>
<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>
"""
# pass HTML to BeautifulSoup object and assign a html.parser as a HTML parser
soup = BeautifulSoup(html, "html.parser")
# grab a phone number (only first occurrence will be extracted)
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
print(soup.select_one('.CjVfdc span').text.strip())
# Telephone : 01564 773348
# extract <div> element with .L581yb class. returns a list()
print(soup.select('.L581yb'))
'''
[<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>]
'''
# extract <div> element with .hJDwNd-AhqUyc-WNfPc class. returns a list()
print(soup.select('.hJDwNd-AhqUyc-WNfPc'))
'''
[<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>]
'''

How to get the text of the next tag? (Beautiful Soup)

The html code is :
<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>
I want to get the output inside the <div> tag i.e. Steven Cantrell .
I need such a way that I should be able to get the contents of next tag. In this case, it is 'span',{'class':'small text-muted'}
What I tried is :
rfq_name = soup.find('span',{'class':'small text-muted'})
print(rfq_name.next)
But this printed Contact instead of the name.
You're nearly there, just change your print to: print(rfq_name.find_next('div').text)
Find the element that has the text "Contact". Then use .find_next() to get the next <div> tag.
from bs4 import BeautifulSoup
html = '''<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
contact = soup.find(text='Contact').find_next('div').text
Output:
print(contact)
Steven Cantrell

Using BeautifulSoup to extract specific nested div

I have this HTML code which I'm creating the script for:
http://imgur.com/a/dPNYI
I would like to extract the highlighted text ("some text") and print it.
I tried going through every nested div in the way to the div I needed, like this:
import requests
from bs4 import BeautifulSoup
url = "the url this is from"
r = requests.get(url)
for div in soup.find_all("div", {"id": "main"}):
for div2 in div.find_all("div", {"id": "app"}):
for div3 in div2.find_all("div", {"id": "right-sidebar"}):
for div4 in div3.find_all("div", {"id": "chat"}):
for div5 in div4.find_all("div", {"id": "chat-messages"}):
for div6 in div5.find_all("div", {"class": "chat-message"}):
for div7 in div6.find_all("div", {"class": "chat-message-content selectable"}):
print(div7.text.strip())
I implemented what I've seen in guides and similar questions online, but I bet this is not even close and there must be a much easier way.This doesn't work. It doesn't print anything, and I'm a bit lost. How can I print the highlighted line (which is essentially the very first div child of the div with the id "chat-messages")?
HTML CODE:
<!DOCTYPE html>
<html>
<head>
<title>
</title>
</head>
<body>
<div id="main">
<div data-reactroot="" id="app">
<div class="top-bar-authenticated" id="top-bar">
</div>
<div class="closed" id="navigation-bar">
</div>
<div id="right-sidebar">
<div id="chat">
<div id="chat-head">
</div>
<div id="chat-title">
</div>
<div id="chat-messages">
<div class="chat-message">
<div class="chat-message-avatar" style="background-image: url("https://steamcdn-a.akamaihd.net/steamcommunity/public/images/avatars/65/657dcec97cc00bc378629930ecae1776c0d981e0.jpg");">
</div>
<a class="chat-message-username clickable">
<div class="iron-color">
aloe
</div></a>
<div class="chat-message-content selectable">
<!-- react-text: 2532 -->some text<!-- /react-text -->
</div>
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
<div class="chat-message">
</div>
Using lxml parser (i.e. soup = BeautifulSoup(data, 'lxml')) you can use .find with multiple classes just as simple as single classes to find nested divs:
soup.find('div',{'class':'chat-message-content selectable'}).text
The line above should work for you as long as the occurence of that class is the only one in the html.

Python Scrapy - dynamic HTML, div and span content needed

So I'm new to Scrapy and am looking to do something which is proving a little too ambitious. I'm hoping somebody out there can help guide me on how to gather and parse the info I'm after from this website.
I need to obtain the following:
label1
4810 (this is generated dynamically)
Business name
Name
Address1
Address2
Address3
Address4
Postcode
0800 111111
me#domain.com
Is this even possible using scrapy?
Many thanks in advance.
<div class="mbg">
<a href="http://www.domain.com" aria-label="label1"> <span class="nw1">Label13345</span>
</a>
<span class="mbg-l">
4810
<img
alt="4810"
title="4810"
src="http://www.domain.com/image1"></span>
</div>
<div id="bsi-c" class=" bsi-c-uk-bislr">
<div class="bsi-cnt">
<div class="bsi-ttl section-ttl">
<h2>Info</h2>
<div class="rd-sep"></div>
</div>
<div class="bsi-bn">Business name</div>
<div class="bsi-cic">
<div id="bsi-ec" class="u-flL">
<span class="bsi-arw"></span>
<span class="bsi-cdt">Contact details</span>
</div>
<div id="e8" class="u-flL bsi-ci">
<div class="bsi-c1">
<div>Name</div>
<div>Address1</div>
<div>Address2</div>
<div>Address3</div>
<div>Address4</div>
<div>Postcode</div>
</div>
<div class="bsi-c2">
<br></br>
<div>
<span class="bsi-lbl">Phone:</span>
<span>0800 111111</span>
</div>
<div>
<span class="bsi-lbl">Email:</span>
<span>me#domain.com</span>
</div>
</div>
</div>
</div>
An example of parsing the already received page might look something like this:
import lxml.html
page="""<div><span> . . .</span></div> """
doc = lxml.html.document_fromstring(page)
# get label1 4810
label = doc.cssselect('.mbg .mbg-l a')[0].text_content()
# get address
addres = doc.cssselect('.u-flL .bsi-c1')[0].text_content()
# get phone
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()
# get mail
mail = doc.cssselect('.bsi-c2 .bsi-lbl')[1].text_content()
if a page must be retrieved from the network can make so:
import requests, lxml.html
page = requests.get('site_.com')
doc = lxml.html.document_fromstring(page.text)
phone = doc.cssselect('.bsi-c2 .bsi-lbl')[0].text_content()

Limiting findall() in beautifulsoup to just a section of the html

Here is my situation, i am scraping this html fine with this code but i dont find how to separate the first section from the second. i just want to scrape the first section and apart the second section. using beautifulsoup4
dont mind myData(link), is the urlopen and html read function.
The html
<div id="first_content" class="header">
<div class="list">
<div class="row">
<a name="03049302"></a>
<div class="col-xs-12 drop-panel-content">
<p>
first section first text. </p>
</div>
<div class="drop-panel drop-panel-one-row-height">
<p class="text-center">Edit</p>
<p class="text-center">Share</p>
</div>
</div>
<div class="row">
<a name="03049303"></a>
<div class="col-xs-12 drop-panel-content">
<p>
first section second text. </p>
</div>
<div class="drop-panel drop-panel-one-row-height">
<p class="text-center">Edit</p>
<p class="text-center">Share</p>
<section id="second_content">
<a name="aname" class="btn-collapse collapsed" data-toggle="collapse" data-target="#aname">
<h3>A Name</h3>
</a>
<div class="collapse flush-width flush-down" id="aname">
<div class="list">
<div class="row">
<a name="03049304"></a>
<div class="col-xs-12 drop-panel-content">
<p>
second section first text. </p>
</div>
<div class="drop-panel drop-panel-one-row-height">
<p class="text-center">Edit</p>
<p class="text-center">Share</p>
</div>
This is the code:
try:
all_data = myData(link).findAll("div", {"class": "col-xs-12 drop-panel-content"})
for data in all_data:
print data.text
except AttributeError as e:
return None
**Apart as in not in the same output
Current output
first section first text.
first section second text.
second section first text.
Wanted output
first section first text.
first section second text.
and wanted output, apart in another function maybe
second section first text.
One option would be to differentiate the sections using that section tag. The second section is inside the section tag, but the first one is not.
all_data = soup.find_all("div", {"class": "col-xs-12 drop-panel-content"})
for data in all_data:
if data.find_parent("section") is None:
print data.get_text(strip=True)
Or, if there are strictly 2 first section texts, simply slice the list of section texts:
all_data = soup.find_all("div", {"class": "col-xs-12 drop-panel-content"})[:2]
for data in all_data:
print data.get_text(strip=True)

Categories