Removing particular content from result parces using beautifulsoup - python

def get_description(link):
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
return desc
This is the code which gives me text from this html
<div class="op_gd14 FL">
<p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>
Read all announcements in Prestige Estate </p><p> </p>
</div>
This result is fine for me, I just want to exclude the content of
Read all announcements in Prestige Estate
from result, that is desc in my script, if it is present and Ignore if it is not present. How can I do this?

You can use extract() to remove unnecessary tags from the find() result:
descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')] # remove <a> tags
return descItem.get_text() # return the text

Just make some changes to last line and add re module
...
return re.sub(r'<a(.*)</a>','',desc)
Output:
'<div class="op_gd14 FL">\n <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br> \n </p><p>

Related

How to scrape last string of <p> tag element?

To start, python is my first language I am learning.
I am scraping a website for rent prices across my city and I am using BeautifulSoup to get the price data, but I am unable to get the value of this tag.
Here is the tag:
<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>
Here is my code:
text = soup.find_all("div", {"class", "plan-group rent"})
for item in text:
rent = item.find_all("p")
for price in rent:
print(price.string)
I also tried:
text = soup.find_all("div", {"class", "plan-group rent"})
for item in text:
rent = item.find_all("p")
for price in rent:
items = price.find_all("strong")
for item in items:
print('item.string')
and that works to print out "Monthly Rent:" but I don't understand why I can't get the actual price. The above code shows me that the monthly rent is in the strong tag, which means that the p tag only contains the price which is what I want.
As mentioned by #kyrony there are two children in your <p> - Cause you select the <strong> you will only get one of the texts.
You could use different approaches stripped_strings:
list(soup.p.stripped_strings)[-1]
or contents
soup.p.contents[-1]
or with recursive argument
soup.p.find(text=True, recursive=False)
Example
from bs4 import BeautifulSoup
html = '''<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>'''
soup = BeautifulSoup(html)
soup.p.contents[-1]
Technically your content has two children
<p><strong class="hidden show-mobile-inline">Monthly Rent: </strong>2,450 +</p>
A strong tag
<strong class="hidden show-mobile-inline">Monthly Rent: </strong>
and a string
2,450 +
The string method in beautiful soup only takes one argument so its going to return None. In order to get the second string you need to use the stripped_strings generator.

How to use loop 'find next sibling' until reaching a certain tag when web scraping with beautifulsoup in python?

The webpage I'm attempting to scrape has a section where the html tags are nested like so:
<div>
<h3>
<p>
<p>
<h3>
<p>
<p>
<p>
My code is able to navigate to the correct tag but I am struggling to split the text by as is a sibling, not a child. I am either able to print just the tags or print all the text within the tag without splitting into sections.
I've tried using for loops but I don't think is the right approach if searching within siblings. I think looping an if statement to determine if find_next_sibling().name = 'h3' might work but I've been unable to iterate this without nesting a large number of if statements.
Can anyone please advise on what approach I should take? Please see my full code below - the treaty files section works fine.
from bs4 import BeautifulSoup
import requests
url = 'https://www.gov.uk/government/publications/albania-tax-treaties'
get_url = requests.get(url)
url_html = get_url.content
soup = BeautifulSoup(url_html, 'lxml')
treaty_files = soup.find_all('div', class_='attachment-details')
for treaty_file in treaty_files:
file_name = treaty_file.h3.a.text
file_url = treaty_file.h3.a['href']
#print(f"Treaty Name: {file_name}")
#print(f"Treaty URL: {file_url}")
#print()
#Attempt 1
treaty_details = soup.find('div', class_='govspeak').find_all('h3')
for treaty_content in treaty_details:
content = treaty_content.find_next_siblings()
for x in content:
test = x
a = test
#print(a)
#Attempt 2
treaty_details = soup.find('div', class_='govspeak').find_all('h3')
for treaty_content in treaty_details:
content = treaty_content.find_next_sibling()
while content.name != 'h3':
print(f"Text: {content.text}")
content = content.find_next_sibling()
if content.name == 'h3':
break
One possible solution is to leverage pandas.Series.groupby function to group sections together:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.gov.uk/government/publications/albania-tax-treaties"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
govspeak = soup.select_one(".govspeak")
s = pd.Series(govspeak.find_all(recursive=False))
for _, g in s.groupby(s.apply(lambda x: x.name).eq("h3").cumsum()):
title = g.iloc[0].text
text = "\n".join(row.text for row in g.iloc[1:])
print(title)
print("-" * 120)
print(text)
print()
print()
Prints:
2021 UK-Albania Synthesised text of the Multilateral Instrument and the 2013 Double Taxation Agreement — in force
------------------------------------------------------------------------------------------------------------------------
The 2013 UK-Albania Double Taxation Agreement has been modified by the Multilateral Instrument (MLI).
The modifications made by the Multilateral Instrument entered into force in:
the UK on 1 October 2018
Albania on 1 January 2021
They are effective in the UK from:
1 January 2021 for taxes withheld at source
1 April 2022 for Corporation Tax
6 April 2022 for Income Tax and Capital Gains Tax
They are effective in Albania from 1 July 2021.
2013 UK-Albania Double Taxation Agreement — in force
------------------------------------------------------------------------------------------------------------------------
The agreement entered into force on 30 December 2013.
It is effective in the UK from:
1 April 2014 for Corporation Tax
6 April 2014 for Income Tax and Capital Gains Tax
It is effective in Albania from 1 January 2014 for Income Tax and Capital Gains Tax.

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

I am using Beautiful Soup in Python to scrape some data from a property listings site.
I have had success in scraping the individual elements that I require but wish to use a more efficient script to pull back all the data in one command if possible.
The difficulty is that the various elements I require reside in different classes.
I have tried the following, so far.
for listing in content.findAll('h2', attrs={"class": "listing-results-attr"}):
print(listing.text)
which successfully gives the following list
15 room mansion for sale
3 bed barn conversion for sale
2 room duplex for sale
1 bed garden shed for sale
Separately, to retrieve the address details for each listing I have used the following successfully;
for address in content.findAll('a', attrs={"class": "listing-results-address"}):
print(address.text)
which gives this
22 Acacia Avenue, CityName Postcode
100 Sleepy Hollow, CityName Postcode
742 Evergreen Terrace, CityName Postcode
31 Spooner Street, CityName Postcode
And for property price I have used this...
for prop_price in content.findAll('a', attrs={"class": "listing-results-price"}):
print(prop_price.text)
which gives...
$350,000
$1,250,000
$750,000
$100,000
This is great however I need to be able to pull back all of this information in a more efficient and performant way such that all the data comes back in one pass.
At present I can do this using something like the code below:
all = content.select("a.listing-results-attr, h2.listing-results-address, a.listing-results-price")
This works somewhat but brings back too much additional HTML tags and is just not nearly as elegant or sophisticated as I require. Results as follows.
</a>, <h2 class="listing-results-attr">
15 room mansion for sale
</h2>, <a class="listing-results-address" href="redacted">22 Acacia Avenue, CityName Postcode</a>, <a class="listing-results-price" href="redacted">
$350,000
Expected results should look something like this:
15 room mansion for sale
22 Acacia Avenue, CityName Postcode
$350,000
3 bed barn conversion for sale
100 Sleepy Hollow, CityName Postcode
$1,250,000
etc
etc
I then need to be able to store the results as JSON objects for later analysis.
Thanks in advance.
Change your selectors as shown below:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
details = ([item.text.strip() for item in soup.select(".listing-results-attr a, .listing-results-address , .text-price")])
You can view separately with, for example,
prices = details[0::3]
descriptions = details[1::3]
addresses = details[2::3]
print(prices, descriptions, addresses)
find_all() function always returns a list, strip() is remove spaces at the beginning and at the end of the string.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
results = soup.find("ul",{'class':"listing-results clearfix js-gtm-list"})
for li in results.find_all("li",{'class':"srp clearfix"}):
price = li.find("a",{"class":"listing-results-price text-price"}).text.strip()
address = li.find("a",{'class':"listing-results-address"}).text.strip()
description = li.find("h2",{'class':"listing-results-attr"}).find('a').text.strip()
print(description)
print(address)
print(price)
O/P:
2 bed detached bungalow for sale
Bronrhiw Fach, Caerphilly CF83
£159,950
2 bed semi-detached house for sale
Cwrt Nant Y Felin, Caerphilly CF83
£159,950
3 bed semi-detached house for sale
Pen-Y-Bryn, Caerphilly CF83
£102,950
.....

Can't scrape particular items from some elements

What to do when there is no container or group to select to parse the required items (which are common in each group) looping through it? I'm willing to parse the text, date and author from the pasted elements. The three results I am after do not belong to any particular group or container so I can't find the right way to get them creating a loop.
Here are the elements:
html = '''
<div class="view-content">
<p class="text-large experts-more-h">
We Have No Idea if Universal Preschool Actually Helps Kids
</p>
<p class="text-sans">
By David J. Armor. Washington Post. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-10-21T09:34:00-04:00">October 21, 2014</span>.
</p>
<p class="text-large experts-more-h">
At Last, Parent Resistance to Collective Standardized Tests
</p>
<p class="text-sans">
By Nat Hentoff. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="2014-01-15T09:57:00-05:00">January 15, 2014</span>.
</p>
<p class="text-sans">
By Darcy Ann Olsen and Eric Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1999-04-15T00:00:00-04:00">April 15, 1999</span>.
</p>
<p class="text-large experts-more-h">
Day Care: Parents versus Professional Advocates
</p>
<p class="text-sans">
By Darcy Ann Olsen. Cato.org. <span class="date-display-single" property="dc:date" datatype="xsd:dateTime" content="1998-06-01T00:00:00-04:00">June 1, 1998</span>.
</p>
</div>
'''
If you run my script, you can see that the scraped results is only the first one:
from lxml.html import fromstring
tree = fromstring(html)
post= tree.cssselect(".text-large a")[0].text
date = tree.cssselect(".date-display-single")[0].text
author = tree.cssselect(".text-sans")[0].text.strip()
print(post+'\n', date+'\n', author)
Result:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
If you run this one, you will see that this script is able to parse all the results I'm after:
from lxml.html import fromstring
tree = fromstring(html)
count = tree.cssselect(".text-large a")
for item in range(len(count)):
post= tree.cssselect(".text-large a")[item].text
date = tree.cssselect(".date-display-single")[item].text
author = tree.cssselect(".text-sans")[item].text.strip()
print(post+'\n', date+'\n', author)
Results:
We Have No Idea if Universal Preschool Actually Helps Kids
October 21, 2014
By David J. Armor. Washington Post.
At Last, Parent Resistance to Collective Standardized Tests
January 15, 2014
By Nat Hentoff. Cato.org.
Day Care: Parents versus Professional Advocates
April 15, 1999
By Darcy Ann Olsen and Eric Olsen. Cato.org.
However, what i did with my second script is not at all pythonic and it will give wrong results if any data is missing. So, how to select a group or container, loop through it and parse all of them? Thanks in advance.
If one of text nodes (post, date, author) is missing, tree.cssselect(selector)[index].text should return you a NoneType object which you cannot handle as a string. To avoid this you can implement
post= tree.cssselect(".text-large a")[item].text or " "
You can also try below XPath solution:
container = tree.cssselect(".text-large")
for item in container:
post = item.xpath('./a')[0].text or " "
date = item.xpath('./following-sibling::p/span[#class="date-display-single"]')[0].text or " "
author = item.xpath('./following-sibling::p[#class="text-sans"]')[0].text.strip() or " "
print(post+'\n', date+'\n', author)

Trouble with scraping <br> tag and datalist with links in it

This is an example of the HTML I'm scraping with Python/Beautifulsoup:
<dl>
<dd>
<strong>
<a name="45790" href="http://www.eslcafe.com/jobs/china/index.cgi?read=45790">Monthly 18000rmb ESL teachers for Shanghai Webi centers</a>
</strong>
<br>
Webi English Shanghai -- Tuesday, 7 March 2017, at 2:17 p.m.
</dd>
<dd></dd>
<dd></dd>
<dd></dd>
</dl>
I am able to scrape the <a href> but I haven't been able to get the text after the <br> despite running different loops.
This is my program:
import bs4 as bs
import urllib.request
sauce = urllib.request.urlopen('http://www.eslcafe.com/jobs/china/').read()
soup = bs.BeautifulSoup(sauce, 'html.parser')
dl = soup.dl
ads = []
for words in dl.find_all('a'):
links = words.get('href')
link_text = words.text
link_text = link_text.lower()
if 'university' in link_text:
ads.append([links, link_text])
if 'universities' in link_text:
ads.append([links, link_text])
if 'college' in link_text:
ads.append([links, link_text])
if 'colleges' in link_text:
ads.append([links, link_text])
for ad in ads:
for job in ad:
print(job)
print("")
There is also a problem with duplicates being added to the list if the text contains more than one of my search terms, but I can deal with that later.
I'm thinking I want to have a list containing lists that hold the link, link_text, and date_text.
ads = [[link, link_text, date_text], [link, link_text, date_text]]
Right now, I can only get the link and link_text.
Any suggestions?
you could use contents
import bs4
soup = bs4.BeautifulSoup('<dl> .... </dl>') # your markup
print(soup.br.contents[0])
gives:
Webi English Shanghai -- Tuesday, 7 March 2017, at 2:17 p.m.
In [31]: for dd in soup.find_all('dd'):
...: link = dd.a.get('href')
...: link_text = dd.a.text
...: *_, dd_text = dd.stripped_strings
out:
http://www.eslcafe.com/jobs/china/index.cgi?read=45391
Teach English in Shenyang, China: Great salary, Support, and Structured program
Greenheart Travel -- Thursday, 9 February 2017, at 1:05 p.m.
dd_text is the last text node of dd tag, so I use *_ to represent all the text node before it.
EDIT:
In [20]: for dd in soup.find_all('dd'):
...:
...: d = {} # store data in a dict
...: d['link'] = dd.a.get('href')
...: d['link_text'] = dd.a.text
...: *_, dd_text = dd.stripped_strings
...: d['date_text'] = dd_text
...: print(d)
out:
{'date_text': 'EnglishTeacherChina.com -- Sunday, 12 February 2017, at 1:45 '
'p.m.',
'link': 'http://www.eslcafe.com/jobs/china/index.cgi?read=45426',
'link_text': '❤ ❤ ❤ Teach English In China 12,000-20,000 RMB/month - Adults '
'or Kids - Free Housing & Airfare - Free TEFL TESOL '
'Certification - Where You Want - YOUR NEEDS ARE OUR TOP '
'PRIORITY ❤ ❤ ❤'}

Categories