Limiting findall() in beautifulsoup to just a section of the html

Limiting findall() in beautifulsoup to just a section of the html - python

Here is my situation, i am scraping this html fine with this code but i dont find how to separate the first section from the second. i just want to scrape the first section and apart the second section. using beautifulsoup4
dont mind myData(link), is the urlopen and html read function.
The html
<div id="first_content" class="header">
<div class="list">
<div class="row">
<a name="03049302"></a>
<div class="col-xs-12 drop-panel-content">
<p>
first section first text. </p>
</div>
<div class="drop-panel drop-panel-one-row-height">
<p class="text-center">Edit</p>
<p class="text-center">Share</p>
</div>
</div>
<div class="row">
<a name="03049303"></a>
<div class="col-xs-12 drop-panel-content">
<p>
first section second text. </p>
</div>
<div class="drop-panel drop-panel-one-row-height">
<p class="text-center">Edit</p>
<p class="text-center">Share</p>
<section id="second_content">
<a name="aname" class="btn-collapse collapsed" data-toggle="collapse" data-target="#aname">
<h3>A Name</h3>
</a>
<div class="collapse flush-width flush-down" id="aname">
<div class="list">
<div class="row">
<a name="03049304"></a>
<div class="col-xs-12 drop-panel-content">
<p>
second section first text. </p>
</div>
<div class="drop-panel drop-panel-one-row-height">
<p class="text-center">Edit</p>
<p class="text-center">Share</p>
</div>
This is the code:
try:
all_data = myData(link).findAll("div", {"class": "col-xs-12 drop-panel-content"})
for data in all_data:
print data.text
except AttributeError as e:
return None
**Apart as in not in the same output
Current output
first section first text.
first section second text.
second section first text.
Wanted output
first section first text.
first section second text.
and wanted output, apart in another function maybe
second section first text.

One option would be to differentiate the sections using that section tag. The second section is inside the section tag, but the first one is not.
all_data = soup.find_all("div", {"class": "col-xs-12 drop-panel-content"})
for data in all_data:
if data.find_parent("section") is None:
print data.get_text(strip=True)
Or, if there are strictly 2 first section texts, simply slice the list of section texts:
all_data = soup.find_all("div", {"class": "col-xs-12 drop-panel-content"})[:2]
for data in all_data:
print data.get_text(strip=True)

Related

how to get specific links with BeautifulSoup?

I am trying to crawl HTML source with Python using BeautifulSoup.
I need to get the href of specific link <a> tags.
This is my test code. I want to get links <a href="/example/test/link/activity1~10"target="_blank">
<div class="listArea">
<div class="activity_sticky" id="activity">
.
.
</div>
<div class="activity_content activity_loaded">
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity1" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292311</span>
</div>
</a>
</div>
</div>
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity2" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292312</span>
</div>
</a>
</div>
</div>
.
.
.
</div>
</div>

Check the main page of the bs4 documentation:
for link in soup.find_all('a'):
print(link.get('href'))

This is a code for the problem. You should find the all <a></a>, then to getting the value of href.
soup = BeautifulSoup(html, 'html.parser')
for i in soup.find_all('a'):
if i['target'] == "_blank":
print(i['href'])
Hope my answer could help you.

Select the <a> specific - lternative to #Mason Ma answer you can also use css selectors:
soup.select('.activity_content a')]
or by its attribute target -
soup.select('.activity_content a[target="_blank"]')
Example
Will give you a list of links, matching your condition:
import requests
from bs4 import BeautifulSoup
html = '''
<div class="activity_content activity_loaded">
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity1" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292311</span>
</div>
</a>
</div>
</div>
<div class="activity-list-item activity_item__1fhpg">
<div class="activity-list-item_activity__3FmEX">
<div>...</div>
<a href="/example/test/link/activity2" target="_blank">
<div class="activity-list-item_addr">
<span> 0x1292312</span>
</div>
</a>
</div>
</div>
'''
soup = BeautifulSoup(html)
[x['href'] for x in soup.select('.activity_content a[target="_blank"]')]
Output
['/example/test/link/activity1', '/example/test/link/activity2']

Based on my understanding of your question, you're trying to extract the links (href) from anchor tags where the target value is _blank. You can do this by searching for all anchor tags then narrowing down to those whose target == '_blank'
links = soup.findAll('a', attrs = {'target' : '_blank'})
for link in links:
print(link.get('href'))

How to extract the data from encoded HTML class using python

How can I retrieve the page encoded div class of a webpage (title html tag) using Python?
Here my sample html code.

You need to use requests to make a request (it will automatically decode the page, in most cases), and beautifulsoup to extract the data from the HTML.
Update after OP clarifications. CSS classes are not dynamically updating, they're the same (that's what I noticed). Since they're the same, you can:
grab a container with all needed data (a container (CSS selector) that wraps needed data)
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
# ...
use regex to filter (find) all needed data via re.findall() and capture group (.*): only this match will be captured and returned. .*: means to capture everything.
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
# ...
Have a look at the SelectorGadget Chrome extension to grab CSS selectors by clicking on the desired element in your browser. On that note, there's a dedicated web scraping with CSS selectors blog post of mine.
Code and example in the online IDE:
import requests, re
from bs4 import BeautifulSoup
html = requests.get("https://sites.google.com/a/arden.solihull.sch.uk/futures/home")
soup = BeautifulSoup(html.text, "html.parser")
# all regular expressions for this task
# https://regex101.com/r/cxdxgq/1
for result in soup.select(".pSzOP-AhqUyc-qWD73c.GNzUNc span"):
if re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text):
name = "".join(re.findall(r"^Careers\s?.*\s?:\s?(.*)", result.text.strip()))
print(name)
if re.findall(r"^Telephone\s?:\s?(.*)", result.text):
telephone = "".join(re.findall(r"^Telephone\s?:\s?(.*)", result.text.strip()))
print(telephone)
if re.findall(r"^Email\s?:\s?(.*)", result.text):
email = "".join(re.findall(r"^Email\s?:\s?(.*)", result.text.strip()))
print(email)
# to scrape the role you can do the same thing with regex. Test on regex101.com
'''
Mrs A. Fallis
01564 773348
afallis#arden.solihull.sch.uk
Mr S. Brady
01564 7733478
sbrady#arden.solihull.sch.uk
'''
First solutions without OP clarifications (shows only extraction part since you haven't provided a website URL):
from bs4 import BeautifulSoup
html = """
<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>
<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>
"""
# pass HTML to BeautifulSoup object and assign a html.parser as a HTML parser
soup = BeautifulSoup(html, "html.parser")
# grab a phone number (only first occurrence will be extracted)
# https://www.crummy.com/software/BeautifulSoup/bs4/doc/#css-selectors
print(soup.select_one('.CjVfdc span').text.strip())
# Telephone : 01564 773348
# extract <div> element with .L581yb class. returns a list()
print(soup.select('.L581yb'))
'''
[<div class="L581yb VICjCf" hjdwnd-ahquyc-r6poud="" jndksc="" l6ctce-pszop"="" l6ctce-purzt="" tabindex=" == $0
<div class=">
</div>]
'''
# extract <div> element with .hJDwNd-AhqUyc-WNfPc class. returns a list()
print(soup.select('.hJDwNd-AhqUyc-WNfPc'))
'''
[<div class="hJDwNd-AhqUyc-WNfPc purZT-AhqUyC-I15mzb PSzOP-AhqUyc-qWD73c JNdks <div class=" jndksc-smkayb"="">
<div class="" f570id"="" jsaction="zXBUYD: ZTPCnb; 2QF9Uc: Qxe3nd;
jsname=" jscontroller="SGWD4d">
>
<div class="oKdM2C KzvoMe">
<div class="hJDwNd-AhqUyc-WNFPC PSzOP-AhqUyc- qWD73c jXK9ad D2fZ2 Oj CsFc whaque GNzUNC" id="h.7f5e93de0cf8a767_49">
<div class="]XK9ad-SmkAyb">
<div class="ty]Ctd mGzaTb baZpAe">
<div class="GV3q8e aP9Z7e" id="h.p_9livxd801krd">
</div>
<h3 class="CDt4ke zfr3Q OmQG5e" dir="ltr" id="h.p_9livxd801krd" tabindex="-1">
.
</h3>
<div class="GV3q8e aP9z7e" id="h.p JrEgQYpyORCF">
</div>
<h3 class="CDt 4Ke zfr3Q OmQG5e" dir="ltr" id="h.p_JrEgQYPYORCF" tabindex="-1">
<div class="CjVfdc" jsaction="touchstart:UrsOsc; click:Kjs
qPd; focusout:QZoaz; mouseover:yOpDld; mouseout:dq0hvd;fvlRjc:jbFSO
d;CrflRd:SzACGe;" jscontroller="Ae65rd">
<div class="PPHIP rviiZ" jsname="haAclf">
.
</div>
<span style="font-family: 'Oswald'; font-weight: 500;">
Telephone : 01564 773348
</span>
</div>
</h3>
<div class="GV3q8e aP9z7e" id="h.p_sylefz-BOSBX">
</div>
><h3 id="h.p_sylefz-BOSBX" dir="ltr" class="CDt 4Ke zfr3Q OmQG5e"
</div>
</div>
</div>
</div>
</div>
</div>]
'''

How to get the text of the next tag? (Beautiful Soup)

The html code is :
<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>
I want to get the output inside the <div> tag i.e. Steven Cantrell .
I need such a way that I should be able to get the contents of next tag. In this case, it is 'span',{'class':'small text-muted'}
What I tried is :
rfq_name = soup.find('span',{'class':'small text-muted'})
print(rfq_name.next)
But this printed Contact instead of the name.

You're nearly there, just change your print to: print(rfq_name.find_next('div').text)
Find the element that has the text "Contact". Then use .find_next() to get the next <div> tag.
from bs4 import BeautifulSoup
html = '''<div class="card border p-3">
<span class="small text-muted">Contact<br></span>
<div>Steven Cantrell</div>
<div class="small">Department of Justice</div>
<div class="small">Federal Bureau of Investigation</div>
<!---->
<!---->
<!---->
<div class="small">skcantrell#fbi.gov</div>
<div class="small">256-313-8835</div>
</div>'''
soup = BeautifulSoup(html, 'html.parser')
contact = soup.find(text='Contact').find_next('div').text
Output:
print(contact)
Steven Cantrell

Python: extract certain class separately by using bs4

<div class="michelinKeyBenefitsComp">
<section id="benefit-one-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Banana is yellow.</h4>
<div class="content">
<p>Yellow is my favorite color.</p>
<p> </p>
<p>I love Banana.</p>
</div>
</div>
</div>
</section>
<section id="benefit-two-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Apple is red.</h4>
<div class="content"><p>Red is not my favorite color.</p>
<p> </p>
<p>I don't like apple.</p>
</div>
</div>
</div>
</section>
</div>
I know how to extract all the text I want from this HTML. Here is my code:
for item in soup.find('div', {'class' : 'michelinKeyBenefitsComp'}):
try:
for tex in item.find_all('div', {'class' : 'col'}):
print(tex.text)
except:
pass
But what i would like to do is extract the content separately, so I can save them separately. The result is expected like this:
Banana is yellow.
Yellow is my favorite color.
I love Banana.
#save first
Apple is red.
Red is not my favorite color.
I don't like apple.
#save next
By the way, in this case, there are only 2 paragraph, but in other cases, there are probably three or more paragraphs. How can I extract them without knowing how many paragraphs they have? TIA

May be you should try this way for extracting text, you have div with unique_id, but for selecting section text inside it you can use classes for properly select text from particular div,
from bs4 import BeautifulSoup
text = """
<div class="michelinKeyBenefitsComp">
<section id="benefit-one-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Banana is yellow.</h4>
<div class="content">
<p>Yellow is my favorite color.</p>
<p> </p>
<p>I love Banana.</p>
</div>
</div>
</div>
</section>
<section id="benefit-two-content">
<div class="inner">
<div class="col">
<h4 class="h-keybenefits">Apple is red.</h4>
<div class="content"><p>Red is not my favorite color.</p>
<p> </p>
<p>I don't like apple.</p>
</div>
</div>
</div>
</section>
</div>
"""
soup = BeautifulSoup(text, 'html.parser')
main_div = soup.find('div', class_='michelinKeyBenefitsComp')
for idx, div in enumerate(main_div.select('section > div.inner > div.col')):
with open('file_'+str(idx)+'.txt', 'w', encoding='utf-8') as f:
f.write(div.get_text())
#Output in separate file: file_1.txt> Banana is yellow.
# Yellow is my favorite color.
# I love Banana.

This should help.
from bs4 import BeautifulSoup
import re
soup = BeautifulSoup(html, "html.parser")
for i in soup.find_all("section", {"id": re.compile("benefit-[a-z]+-content")}):
with open(i["id"]+".txt", "a") as outfile: #Create filename based on section ID and write.
outfile.write("\n".join([i for i in i.text.strip().split("\n") if i.strip()]) + "\n\n")

How to scrape a whole website using beautifulsoup

I'm quite new to Programming and OO programming especially. Nonetheless, I'm trying to write a very simple Spider for web crawling. Here's my first approach:
I need to fetch the data out of this page: http://europa.eu/youth/volunteering/evs-organisation_en
Firstly, I do a view on the page source to find HTML elements?
view-source:https://europa.eu/youth/volunteering/evs-organisation_en
Note: I need to fetch the data that comes right below this line:
EVS accredited organisations search results: 6066
I chose beautiful soup for this job - since it is very powerful:
I Use find_all:
soup.find_all('p')[0].get_text() # Searching for tags by class and id
Note: Classes and IDs are used by CSS to determine which HTML elements to apply certain styles to. We can also use them when scraping to specify specific elements we want to scrape.
See the class:
<div class="col-md-4">
<div class="vp ey_block block-is-flex">
<div class="ey_inner_block">
<h4 class="text-center">"People need people" Zaporizhya oblast civic organisation of disabled families</h4>
<p class="ey_info">
<i class="fa fa-location-arrow fa-lg"></i>
Zaporizhzhya, <strong>Ukraine</strong>
</p> <p class="ey_info"><i class="fa fa-hand-o-right fa-lg"></i> Sending</p>
<p><strong>PIC no:</strong> 935175449</p>
<div class="empty-block">
Read more </div>
</div>
so this leads to:
# import libraries
import urllib2
from bs4 import BeautifulSoup
page = requests.get("https://europa.eu/youth/volunteering/evs-organisation_en")
soup = BeautifulSoup(page.content, 'html.parser')
soup
Now, we can use the find_all method to search for items by class or by id. In the below example, we'll search for any p tag that has the class outer-text
<div class="col-md-4">
so we choose:
soup.find_all(class_="col-md-4")
Now I have to combine all.
update: my approach: so far:
I have extracted data wrapped within multiple HTML tags from a webpage using BeautifulSoup4. I want to store all of the extracted data in a list. And - to be more concrete: I want each of the extracted data as separate list elements separated by a comma (i.e.CSV-formated).
To begin with the beginning:
here we have the HTML content structure:
<div class="view-content">
<div class="row is-flex"></span>
<div class="col-md-4"></span>
<div class </span>
<div class= >
<h4 Data 1 </span>
<div class= Data 2</span>
<p class=
<i class=
<strong>Data 3 </span>
</p> <p class= Data 4 </span>
<p class= Data 5 </span>
<p><strong>Data 6</span>
<div class=</span>
<a href="Data 7</span>
</div>
</div>
Code to extract:
for data in elem.find_all('span', class_=""):
This should give an output:
data = [ele.text for ele in soup.find_all('span', {'class':'NormalTextrun'})]
print(data)
Output:
[' Data 1 ', ' Data 2 ', ' Data 3 ' and so forth]
question: / i need help with the extraction part...

try this
data = [ele.text for ele in soup.find_all(text = True) if ele.text.strip() != '']
print(data)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Limiting findall() in beautifulsoup to just a section of the html - python

Related

how to get specific links with BeautifulSoup?

How to extract the data from encoded HTML class using python

How to get the text of the next tag? (Beautiful Soup)

Python: extract certain class separately by using bs4

How to scrape a whole website using beautifulsoup

Categories

Resources