python BeautifulSoup Wikipedia Webscapping -learning

python BeautifulSoup Wikipedia Webscapping -learning - python

I learning Python and BeautifulSoup
I am trying to do some webscraping:
Let me first describe want I am trying to do?
the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks
I am trying to print out the
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
I want to print out the text: By market capitalization
Then the text of the table of the banks:
Example:
By market capitalization
Rank
Bank
Cap Rate
1
JP Morgan
466.1
2
Bank of China
300
all the way to 50
My code starts out like this:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
I believe my problem is more on the html side of things:
But I am completely lost:
I inspected the element and the tags that I believe to look for are
{section class_='mf-section-2 collapsible-block open-block'}

Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]
or
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
Example
from bs4 import BeautifulSoup
import requests
import panda as pd
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
header = soup.select_one('h2:has(>#By_market_capitalization)')
print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
Output
By market capitalization
Rank
Bank name
Market cap(US$ billion)
1
JPMorgan Chase
466.21[5]
2
Industrial and Commercial Bank of China
295.65
3
Bank of America
279.73
4
Wells Fargo
214.34
5
China Construction Bank
207.98
6
Agricultural Bank of China
181.49
7
HSBC Holdings PLC
169.47
8
Citigroup Inc.
163.58
9
Bank of China
151.15
10
China Merchants Bank
133.37
11
Royal Bank of Canada
113.80
12
Toronto-Dominion Bank
106.61
...

As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:
import pandas as pd
df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0, drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

Related

How to use loop 'find next sibling' until reaching a certain tag when web scraping with beautifulsoup in python?

The webpage I'm attempting to scrape has a section where the html tags are nested like so:
<div>
<h3>
<p>
<p>
<h3>
<p>
<p>
<p>
My code is able to navigate to the correct tag but I am struggling to split the text by as is a sibling, not a child. I am either able to print just the tags or print all the text within the tag without splitting into sections.
I've tried using for loops but I don't think is the right approach if searching within siblings. I think looping an if statement to determine if find_next_sibling().name = 'h3' might work but I've been unable to iterate this without nesting a large number of if statements.
Can anyone please advise on what approach I should take? Please see my full code below - the treaty files section works fine.
from bs4 import BeautifulSoup
import requests
url = 'https://www.gov.uk/government/publications/albania-tax-treaties'
get_url = requests.get(url)
url_html = get_url.content
soup = BeautifulSoup(url_html, 'lxml')
treaty_files = soup.find_all('div', class_='attachment-details')
for treaty_file in treaty_files:
file_name = treaty_file.h3.a.text
file_url = treaty_file.h3.a['href']
#print(f"Treaty Name: {file_name}")
#print(f"Treaty URL: {file_url}")
#print()
#Attempt 1
treaty_details = soup.find('div', class_='govspeak').find_all('h3')
for treaty_content in treaty_details:
content = treaty_content.find_next_siblings()
for x in content:
test = x
a = test
#print(a)
#Attempt 2
treaty_details = soup.find('div', class_='govspeak').find_all('h3')
for treaty_content in treaty_details:
content = treaty_content.find_next_sibling()
while content.name != 'h3':
print(f"Text: {content.text}")
content = content.find_next_sibling()
if content.name == 'h3':
break

One possible solution is to leverage pandas.Series.groupby function to group sections together:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.gov.uk/government/publications/albania-tax-treaties"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
govspeak = soup.select_one(".govspeak")
s = pd.Series(govspeak.find_all(recursive=False))
for _, g in s.groupby(s.apply(lambda x: x.name).eq("h3").cumsum()):
title = g.iloc[0].text
text = "\n".join(row.text for row in g.iloc[1:])
print(title)
print("-" * 120)
print(text)
print()
print()
Prints:
2021 UK-Albania Synthesised text of the Multilateral Instrument and the 2013 Double Taxation Agreement — in force
------------------------------------------------------------------------------------------------------------------------
The 2013 UK-Albania Double Taxation Agreement has been modified by the Multilateral Instrument (MLI).
The modifications made by the Multilateral Instrument entered into force in:
the UK on 1 October 2018
Albania on 1 January 2021
They are effective in the UK from:
1 January 2021 for taxes withheld at source
1 April 2022 for Corporation Tax
6 April 2022 for Income Tax and Capital Gains Tax
They are effective in Albania from 1 July 2021.
2013 UK-Albania Double Taxation Agreement — in force
------------------------------------------------------------------------------------------------------------------------
The agreement entered into force on 30 December 2013.
It is effective in the UK from:
1 April 2014 for Corporation Tax
6 April 2014 for Income Tax and Capital Gains Tax
It is effective in Albania from 1 January 2014 for Income Tax and Capital Gains Tax.

i have 3 list which i want to print items by order, in a enumerate way

import requests
from bs4 import BeautifulSoup
URL= "https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia"
page= requests.get(URL)
soup=BeautifulSoup(page.content, "html.parser")
weak=soup.find(id="SearchResults")
jobname=(weak.find_all(class_="summary"))
jobnamelists=[]
companyname=(weak.find_all(class_="company"))
companynamelists=[]
locations=(weak.find_all(class_="location"))
locationlist=[]
for job in jobname:
jobnamelists.append(job.find(class_="title").get_text())
for company in companyname:
companynamelists.append(company.find(class_="name").get_text())
for location in locations:
locationlist.append(location.find(class_="name").get_text())
this is the code, in the end it makes me 3 seperate lists which i scrape from the web,
now i want them to be printed in an enumerated way that the first job will be printed with the first company and the first location
one by one
anyone can help me on that?

As stated in the comments, use zip() function to iterate over the three lists together. For example:
import requests
from bs4 import BeautifulSoup
URL = "https://www.monster.com/jobs/search/?q=Software-Developer&where=Australia"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
for j, c, l in zip(soup.select('#SearchResults .summary .title'),
soup.select('#SearchResults .company .name'),
soup.select('#SearchResults .location .name')):
print(j.get_text(strip=True))
print(c.get_text(strip=True))
print(l.get_text(strip=True))
print('-' * 80)
Prints:
Resident Engineer (Software) Cyber Security - Sydney
Varmour
Sydney, NSW
--------------------------------------------------------------------------------
Senior/Lead Software Engineer, Browser
Magic Leap, Inc.
Sunnyvale, CA; Plantation, FL (HQ); Austin, TX; Culver New York City, CA; Seattle, WA; Toronto, NY
--------------------------------------------------------------------------------
Service Consultant REST
TAL
Sydney, NSW
--------------------------------------------------------------------------------
...and so on.

REGEX extract information from EDGAR SC-13 form

I am trying to extract information from the latest SEC EDGAR Schedule 13 forms filings.
Link of the filing as an example:
1) Saba Capital_27-Dec-2019_SC13
The information I am trying to extract (and the parts of the filing with the information)
1) Names of reporting persons: Saba Capital Management, L.P.
<p style="margin-bottom: 0pt;">NAME OF REPORTING PERSON</p>
<p style="margin-top: 0pt; margin-left: 18pt;">Saba Capital Management GP, LLC<br><br/>
2) Name of issuer : WESTERN ASSET HIGH INCOME FUND II INC
<p style="text-align: center;"><b><font size="5"><u>WESTERN ASSET HIGH INCOME FUND II INC.</u></font><u><br/></u>(Name of Issuer)</b>
3) CUSIP Number: 95766J102 (managed to get)
<p style="text-align: center;"><b><u>95766J102<br/></u>(CUSIP Number)</b>
4) Percentage of class represented by amount : 11.3% (managed to get)
<p style="margin-bottom: 0pt;">PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)</p>
<p style="margin-top: 0pt; margin-left: 18pt;">11.3%<br><br/>
5) Date of Event Which requires filing of this statement: December 24, 2019
<p style="text-align: center;"><b><u>December 24, 2019<br/></u>(Date of Event Which Requires Filing of This Statement)</b>
.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'xml')
## get CUSIP number
CUSIP = re.findall(r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*##]{3}[0-9]', soup.text)
### get %
regex = r"(?<=PERCENT OF CLASS|Percent of class)(.*)(?=%)"
percent = re.findall(r'\d+.\d+', re.search(regex, soup.text, re.DOTALL).group().split('%')[0])
How can I extract the 5 pieces of information from the filing? Thanks in advance

Try the following Code to get all the values.Using find() and css selector select_one()
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-child(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-child(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-child(11) > b > u").text.strip()
print(Dateof)
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019
UPDATED
If you don't want to use position then try below one.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:contains(Issuer)').find_next('u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one('p:contains(CUSIP)').find_next('u').text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one('p:contains(Event)').find_next('u').text.strip()
print(Dateof)
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019
Update 2:
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-of-type(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-of-type(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-of-type(11) > b > u").text.strip()
print(Dateof)

Using lxml, it should work this way:
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
name = doc.xpath('//p[text()="NAME OF REPORTING PERSON"]/following-sibling::p/text()')[0]
issuer = doc.xpath('//p[contains(text(),"(Name of Issuer)")]//u/text()')[0]
cusip = doc.xpath('//p[contains(text(),"(CUSIP Number)")]//u/text()')[0]
perc = doc.xpath('//p[contains(text(),"PERCENT OF CLASS REPRESENTED")]/following-sibling::p/text()')[0]
event = doc.xpath('//p[contains(text(),"(Date of Event Which Requires")]//u/text()')[0]
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

Get Text from h1 with BeautifulSoup

I was asked to get a product name from a web.
I was asked to get this text:
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
This is my BeautifulSoup code:
import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'lxml')
company = soup.select('h1.it-ttl')[0].text.strip()
print(company)
The HTML from the code is:
<h1 class="it-ttl" id="itemTitle" itemprop="name">
<span class="g-hdn">Details about
</span>
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
</h1>
Instead of the desired text, I get this:
Details about SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
How can I extract only the product name?

import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'html.parser')
company = soup.select('h1.it-ttl')[0].text.strip()
span_text = soup.select('span.g-hdn')[0].text.strip()
print(company)
print(span_text)
print(company.lstrip(span_text))
Since the span tag is nested in the h1 tag, the necessary step is to extract the span text and remove it from the h1 tag with the lstrip method.

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

I am using Beautiful Soup in Python to scrape some data from a property listings site.
I have had success in scraping the individual elements that I require but wish to use a more efficient script to pull back all the data in one command if possible.
The difficulty is that the various elements I require reside in different classes.
I have tried the following, so far.
for listing in content.findAll('h2', attrs={"class": "listing-results-attr"}):
print(listing.text)
which successfully gives the following list
15 room mansion for sale
3 bed barn conversion for sale
2 room duplex for sale
1 bed garden shed for sale
Separately, to retrieve the address details for each listing I have used the following successfully;
for address in content.findAll('a', attrs={"class": "listing-results-address"}):
print(address.text)
which gives this
22 Acacia Avenue, CityName Postcode
100 Sleepy Hollow, CityName Postcode
742 Evergreen Terrace, CityName Postcode
31 Spooner Street, CityName Postcode
And for property price I have used this...
for prop_price in content.findAll('a', attrs={"class": "listing-results-price"}):
print(prop_price.text)
which gives...
$350,000
$1,250,000
$750,000
$100,000
This is great however I need to be able to pull back all of this information in a more efficient and performant way such that all the data comes back in one pass.
At present I can do this using something like the code below:
all = content.select("a.listing-results-attr, h2.listing-results-address, a.listing-results-price")
This works somewhat but brings back too much additional HTML tags and is just not nearly as elegant or sophisticated as I require. Results as follows.
</a>, <h2 class="listing-results-attr">
15 room mansion for sale
</h2>, <a class="listing-results-address" href="redacted">22 Acacia Avenue, CityName Postcode</a>, <a class="listing-results-price" href="redacted">
$350,000
Expected results should look something like this:
15 room mansion for sale
22 Acacia Avenue, CityName Postcode
$350,000
3 bed barn conversion for sale
100 Sleepy Hollow, CityName Postcode
$1,250,000
etc
etc
I then need to be able to store the results as JSON objects for later analysis.
Thanks in advance.

Change your selectors as shown below:
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
details = ([item.text.strip() for item in soup.select(".listing-results-attr a, .listing-results-address , .text-price")])
You can view separately with, for example,
prices = details[0::3]
descriptions = details[1::3]
addresses = details[2::3]
print(prices, descriptions, addresses)

find_all() function always returns a list, strip() is remove spaces at the beginning and at the end of the string.
import requests
from bs4 import BeautifulSoup as bs
url = 'https://www.zoopla.co.uk/for-sale/property/caerphilly/?q=Caerphilly&results_sort=newest_listings&search_source=home'
r = requests.get(url)
soup = bs(r.content, 'lxml')
results = soup.find("ul",{'class':"listing-results clearfix js-gtm-list"})
for li in results.find_all("li",{'class':"srp clearfix"}):
price = li.find("a",{"class":"listing-results-price text-price"}).text.strip()
address = li.find("a",{'class':"listing-results-address"}).text.strip()
description = li.find("h2",{'class':"listing-results-attr"}).find('a').text.strip()
print(description)
print(address)
print(price)
O/P:
2 bed detached bungalow for sale
Bronrhiw Fach, Caerphilly CF83
£159,950
2 bed semi-detached house for sale
Cwrt Nant Y Felin, Caerphilly CF83
£159,950
3 bed semi-detached house for sale
Pen-Y-Bryn, Caerphilly CF83
£102,950
.....

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

python BeautifulSoup Wikipedia Webscapping -learning - python

Related

How to use loop 'find next sibling' until reaching a certain tag when web scraping with beautifulsoup in python?

i have 3 list which i want to print items by order, in a enumerate way

REGEX extract information from EDGAR SC-13 form

Get Text from h1 with BeautifulSoup

Difficulty using beautifulsoup in Python to scrape web data from multiple HTML classes

Categories

Resources