REGEX extract information from EDGAR SC-13 form

REGEX extract information from EDGAR SC-13 form - python

I am trying to extract information from the latest SEC EDGAR Schedule 13 forms filings.
Link of the filing as an example:
1) Saba Capital_27-Dec-2019_SC13
The information I am trying to extract (and the parts of the filing with the information)
1) Names of reporting persons: Saba Capital Management, L.P.
<p style="margin-bottom: 0pt;">NAME OF REPORTING PERSON</p>
<p style="margin-top: 0pt; margin-left: 18pt;">Saba Capital Management GP, LLC<br><br/>
2) Name of issuer : WESTERN ASSET HIGH INCOME FUND II INC
<p style="text-align: center;"><b><font size="5"><u>WESTERN ASSET HIGH INCOME FUND II INC.</u></font><u><br/></u>(Name of Issuer)</b>
3) CUSIP Number: 95766J102 (managed to get)
<p style="text-align: center;"><b><u>95766J102<br/></u>(CUSIP Number)</b>
4) Percentage of class represented by amount : 11.3% (managed to get)
<p style="margin-bottom: 0pt;">PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW (11)</p>
<p style="margin-top: 0pt; margin-left: 18pt;">11.3%<br><br/>
5) Date of Event Which requires filing of this statement: December 24, 2019
<p style="text-align: center;"><b><u>December 24, 2019<br/></u>(Date of Event Which Requires Filing of This Statement)</b>
.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'xml')
## get CUSIP number
CUSIP = re.findall(r'[0-9]{3}[a-zA-Z0-9]{2}[a-zA-Z0-9*##]{3}[0-9]', soup.text)
### get %
regex = r"(?<=PERCENT OF CLASS|Percent of class)(.*)(?=%)"
percent = re.findall(r'\d+.\d+', re.search(regex, soup.text, re.DOTALL).group().split('%')[0])
How can I extract the 5 pieces of information from the filing? Thanks in advance

Try the following Code to get all the values.Using find() and css selector select_one()
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-child(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-child(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-child(11) > b > u").text.strip()
print(Dateof)
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019
UPDATED
If you don't want to use position then try below one.
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:contains(Issuer)').find_next('u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one('p:contains(CUSIP)').find_next('u').text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one('p:contains(Event)').find_next('u').text.strip()
print(Dateof)
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019
Update 2:
import requests
import re
from bs4 import BeautifulSoup
page = requests.get('https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm')
soup = BeautifulSoup(page.text, 'lxml')
NameReportingperson=soup.find('p', text=re.compile('NAME OF REPORTING PERSON')).find_next('p').text.strip()
print(NameReportingperson)
NameOftheIssuer=soup.select_one('p:nth-of-type(7) > b u').text.strip()
print(NameOftheIssuer)
CUSIP=soup.select_one("p:nth-of-type(9) > b > u").text.strip()
print(CUSIP)
percentage=soup.find('p', text=re.compile('PERCENT OF CLASS REPRESENTED BY AMOUNT IN ROW')).find_next('p').text.strip()
print(percentage)
Dateof=soup.select_one("p:nth-of-type(11) > b > u").text.strip()
print(Dateof)

Using lxml, it should work this way:
import requests
import lxml.html
url = 'https://www.sec.gov/Archives/edgar/data/1058239/000106299319004848/formsc13da.htm'
source = requests.get(url)
doc = lxml.html.fromstring(source.text)
name = doc.xpath('//p[text()="NAME OF REPORTING PERSON"]/following-sibling::p/text()')[0]
issuer = doc.xpath('//p[contains(text(),"(Name of Issuer)")]//u/text()')[0]
cusip = doc.xpath('//p[contains(text(),"(CUSIP Number)")]//u/text()')[0]
perc = doc.xpath('//p[contains(text(),"PERCENT OF CLASS REPRESENTED")]/following-sibling::p/text()')[0]
event = doc.xpath('//p[contains(text(),"(Date of Event Which Requires")]//u/text()')[0]
Output:
Saba Capital Management, L.P.
WESTERN ASSET HIGH INCOME FUND II INC.
95766J102
11.3%
December 24, 2019

Related

python BeautifulSoup Wikipedia Webscapping -learning

I learning Python and BeautifulSoup
I am trying to do some webscraping:
Let me first describe want I am trying to do?
the wiki page: https://en.m.wikipedia.org/wiki/List_of_largest_banks
I am trying to print out the
<span class="mw-headline" id="By_market_capitalization" tabindex="0" role="button" aria-controls="content-collapsible-block-1" aria-expanded="true">By market capitalization</span>
I want to print out the text: By market capitalization
Then the text of the table of the banks:
Example:
By market capitalization
Rank
Bank
Cap Rate
1
JP Morgan
466.1
2
Bank of China
300
all the way to 50
My code starts out like this:
from bs4 import
import requests
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
# text = soup.find('span', class_='mw-headline', id='By_market_capitalization').text
Ak_soup = soup.find_all('section', class_='mf-section-2 collapsible-block open-block', id='content-collapsible-block-1')
print(Ak_soup)
I believe my problem is more on the html side of things:
But I am completely lost:
I inspected the element and the tags that I believe to look for are
{section class_='mf-section-2 collapsible-block open-block'}

Close to your goal - Find the heading and than its next table and transform it via pandas.read_html() to dataframe.
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(str(header.find_next('table')))[0]
or
header = soup.select_one('h2:has(>#By_market_capitalization)')
pd.read_html(html_text, match='Market cap')[0]
Example
from bs4 import BeautifulSoup
import requests
import panda as pd
html_text = requests.get('https://en.wikipedia.org/wiki/List_of_largest_banks').text
soup = BeautifulSoup(html_text, 'lxml')
header = soup.select_one('h2:has(>#By_market_capitalization)')
print(header.span.text)
print(pd.read_html(str(header.find_next('table')))[0].to_markdown(index=False))
Output
By market capitalization
Rank
Bank name
Market cap(US$ billion)
1
JPMorgan Chase
466.21[5]
2
Industrial and Commercial Bank of China
295.65
3
Bank of America
279.73
4
Wells Fargo
214.34
5
China Construction Bank
207.98
6
Agricultural Bank of China
181.49
7
HSBC Holdings PLC
169.47
8
Citigroup Inc.
163.58
9
Bank of China
151.15
10
China Merchants Bank
133.37
11
Royal Bank of Canada
113.80
12
Toronto-Dominion Bank
106.61
...

As you know the desired header you can just direct print. Then with pandas, you can use a unique search term from the target table as a more direct select method:
import pandas as pd
df = pd.read_html('https://en.m.wikipedia.org/wiki/List_of_largest_banks', match = 'Market cap')[0].reset_index(level = 0, drop = True)
print('By market capitalization')
print()
print(df.to_markdown(index = False))

How to use loop 'find next sibling' until reaching a certain tag when web scraping with beautifulsoup in python?

The webpage I'm attempting to scrape has a section where the html tags are nested like so:
<div>
<h3>
<p>
<p>
<h3>
<p>
<p>
<p>
My code is able to navigate to the correct tag but I am struggling to split the text by as is a sibling, not a child. I am either able to print just the tags or print all the text within the tag without splitting into sections.
I've tried using for loops but I don't think is the right approach if searching within siblings. I think looping an if statement to determine if find_next_sibling().name = 'h3' might work but I've been unable to iterate this without nesting a large number of if statements.
Can anyone please advise on what approach I should take? Please see my full code below - the treaty files section works fine.
from bs4 import BeautifulSoup
import requests
url = 'https://www.gov.uk/government/publications/albania-tax-treaties'
get_url = requests.get(url)
url_html = get_url.content
soup = BeautifulSoup(url_html, 'lxml')
treaty_files = soup.find_all('div', class_='attachment-details')
for treaty_file in treaty_files:
file_name = treaty_file.h3.a.text
file_url = treaty_file.h3.a['href']
#print(f"Treaty Name: {file_name}")
#print(f"Treaty URL: {file_url}")
#print()
#Attempt 1
treaty_details = soup.find('div', class_='govspeak').find_all('h3')
for treaty_content in treaty_details:
content = treaty_content.find_next_siblings()
for x in content:
test = x
a = test
#print(a)
#Attempt 2
treaty_details = soup.find('div', class_='govspeak').find_all('h3')
for treaty_content in treaty_details:
content = treaty_content.find_next_sibling()
while content.name != 'h3':
print(f"Text: {content.text}")
content = content.find_next_sibling()
if content.name == 'h3':
break

One possible solution is to leverage pandas.Series.groupby function to group sections together:
import requests
import pandas as pd
from bs4 import BeautifulSoup
url = "https://www.gov.uk/government/publications/albania-tax-treaties"
soup = BeautifulSoup(requests.get(url).content, "html.parser")
govspeak = soup.select_one(".govspeak")
s = pd.Series(govspeak.find_all(recursive=False))
for _, g in s.groupby(s.apply(lambda x: x.name).eq("h3").cumsum()):
title = g.iloc[0].text
text = "\n".join(row.text for row in g.iloc[1:])
print(title)
print("-" * 120)
print(text)
print()
print()
Prints:
2021 UK-Albania Synthesised text of the Multilateral Instrument and the 2013 Double Taxation Agreement — in force
------------------------------------------------------------------------------------------------------------------------
The 2013 UK-Albania Double Taxation Agreement has been modified by the Multilateral Instrument (MLI).
The modifications made by the Multilateral Instrument entered into force in:
the UK on 1 October 2018
Albania on 1 January 2021
They are effective in the UK from:
1 January 2021 for taxes withheld at source
1 April 2022 for Corporation Tax
6 April 2022 for Income Tax and Capital Gains Tax
They are effective in Albania from 1 July 2021.
2013 UK-Albania Double Taxation Agreement — in force
------------------------------------------------------------------------------------------------------------------------
The agreement entered into force on 30 December 2013.
It is effective in the UK from:
1 April 2014 for Corporation Tax
6 April 2014 for Income Tax and Capital Gains Tax
It is effective in Albania from 1 January 2014 for Income Tax and Capital Gains Tax.

Beautifulsoup4 - not selcting all instances of span class

I am attempting to scrape data from a website that uses non-specific span classes to format/display content. The pages present information about chemical products and each product is described within a single div class.
I first parsed by that div class and am working to pull the data I need from there. I have been able to get many things but the parts I cant seem to pull are within the span class "ppisreportspan"
If you look at the code, you will note that it appears multiple times within each chemical description.
<tr>
<td><h4 id='stateprod'>MAINE STATE PRODUCT REPORT</h4><hr class='report'><span style="color:Maroon;" Class="subtitle">Company Number: </span><span style='color:black;' Class="subtitle">38</span><br /><span Class="subtitle">MONSANTO COMPANY <br/>800 N. LINDBERGH BOULEVARD <br/>MAIL STOP FF4B <br/>ST LOUIS MO 63167-0001<br/></span><br/><span style="color:Maroon;" Class="subtitle">Number of Currently Registered Products: </span><span style='color:black; font-size:14px' class="subtitle">80</span><br /><br/><p class='noprint'><img alt='' src='images/epalogo.png' /> View the label in the US EPA Pesticide Product Label System (PPLS).<br /><img alt='' src='images/alstar.png' /> View the label in the Accepted Labels State Tracking and Repository (ALSTAR).<br /></p>
<hr class='report'>
<div class='nopgbrk'>
<span class='ppisreportspanprodname'>PRECEPT INSECTICIDE </span>
<br/>EPA Registration Number: <a href = "http://iaspub.epa.gov/apex/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:100-1075" target='blank'>100-1075-524 <img alt='EPA PPLS Link' src='images/pplslink.png'/></a>
<span class='line-break'></span>
<span class=ppisProd>ME Product Number: </span>
<**span class="ppisreportspan"**>2014000996</span>
<br />Registration Year: <**span class="ppisreportspan"**>2019</span>
Type: <span class="ppisreportspan">RESTRICTED</span><br/><br/>
<table width='100%'>
<tr>
<td width='13%'>Percent</td>
<td style='width:87%;align:left'>Active Ingredient</td>
</tr>
<tr>
<td><span class="ppisreportspan">3.0000</span></td>
<td><span class="ppisreportspan">Tefluthrin (128912)</span></td>
</tr>
</table><hr />
</div>
<div class='nopgbrk'>
<span class='ppisreportspanprodname' >ACCELERON IC-609 INSECTICIDE SEED TREATMENT FOR CORN</span>
<br/>EPA Registration Number: <a href = "http://iaspub.epa.gov/apex/pesticides/f?p=PPLS:102:::NO::P102_REG_NUM:264-789" target='blank'>264-789-524 <img alt='EPA PPLS Link' src='images/pplslink.png'/>
</a><span class='line-break'></span>
<span class=ppisProd>ME Product Number: <a href = "alstar_label.aspx?LabelId=116671" target = 'blank'>2009005053</span>
<img alt='ALSTAR Link' src='images/alstar.png'/></a>
<br />Registration Year: <span class="ppisreportspan">2019</span>
<br/>
<table width='100%'>
<tr>
<td width='13%'>Percent</td>
<td style='width:87%;align:left'>Active Ingredient</td>
</tr>
<tr>
<td><span class="ppisreportspan">48.0000</span></td>
<td><span class="ppisreportspan">Clothianidin (44309)</span></td>
</tr>
</table><hr />
</div>
This sample includes two chemicals. One has an "alstar" ID and link and one does not. Both have registration years. Those are the data points that are hanging me up.
You may also note that there is a 10 digit code stored in "ppisreportspan" in the first example. I was able to extract that as part of the "ppisProd" span for nay record that doesn't have the Alstar link. I don't understand why, but it reinforces the point that it seems my parsing process ignores that span class.
I have tried various methods for the last 2 days based on all kinds of different answers on SO, so I can't possibly list them all. I seem to be able to either get anything from the first "span" to the end on the last span, or I get "nonetype" errors or empty lists.
This one gets the closest:
It returns the correct spans for many div chunks but it still skips (returns blank tuple []) for any of the ones that have alstar links like the second one in the example.
picture showing data and then a series of three sets of empty brackets where the data should be
import urllib.request, urllib.parse, urllib.error
from bs4 import BeautifulSoup
import ssl
import re
url = input('Enter URL:')
hand = open(url)
soup = BeautifulSoup(hand, 'html.parser')
#create a list of chunks by product (div)
products = soup.find_all('div' , class_ ='nopgbrk')
print(type(products))
print(len(products))
tempalstars =[]
rptspanclasses = []
regyears = []
alstarIDs = []
asltrlinks = []
# read the span tags
for product in products:
tempalstar = product.find_all('span', class_= "ppisreportspan")
tempalstars.append(tempalstar)
print(tempalstar)
Ultimately, I want to be able to select the text for the year as well as the Alstar link out of these span statements for each div chunk, but I will be cross that bridge when I can get the code finding all the instances of that class.
Alternately - Is there some easier way I can get the Registration year and the Alstar link (eg. <a href = "alstar_label.aspx?LabelId=116671" target = 'blank'>2009005053</span> <img alt='ALSTAR Link' src='images/alstar.png'/></a>) rather than what I am trying to do?
I am using Python 3.7.2 and Thank you!

I managed to get some data from this site. All you need to know is the company number, in case of monsanto, the number is 38 (this number is shown in after selecting Maine and typing monsanto in the search box:
import re
import requests
from bs4 import BeautifulSoup
url_1 = 'http://npirspublic.ceris.purdue.edu/state/state_menu.aspx?state=ME'
url_2 = 'http://npirspublic.ceris.purdue.edu/state/company.aspx'
company_name = 'monsanto'
company_number = '38'
with requests.session() as s:
r = s.get(url_1)
soup = BeautifulSoup(r.text, 'lxml')
data = {i['name']: '' for i in soup.select('input[name]')}
for i in soup.select('input[value]'):
data[i['name']] = i['value']
data['ctl00$ContentPlaceHolder1$search'] = 'company'
data['ctl00$ContentPlaceHolder1$TextBoxInput1'] = company_name
r = s.post(url_1, data=data)
soup = BeautifulSoup(r.text, 'lxml')
data = {i['name']: '' for i in soup.select('input[name]')}
for i in soup.select('input[value]'):
data[i['name']] = i['value']
data = {k: v for k, v in data.items() if not k.startswith('ctl00$ContentPlaceHolder1$')}
data['ctl00$ContentPlaceHolder1${}'.format(company_number)] = 'Display+Products'
r = s.post(url_2, data=data)
soup = BeautifulSoup(r.text, 'lxml')
for div in soup.select('.nopgbrk'):
#extract name
print(div.select_one('.ppisreportspanprodname').text)
#extract ME product number:
s = ''.join(re.findall(r'\d{10}', div.text))
print(s)
#extract alstar link
s = div.select_one('a[href*="alstar_label.aspx"]')
if s:
print(s['href'])
else:
print('No ALSTAR link')
#extract Registration year:
s = div.find(text=lambda t: 'Registration Year:' in t)
if s:
print(s.next.text)
else:
print('No registration year.')
print('-' * 80)
Prints:
PRECEPT INSECTICIDE
2014000996
No ALSTAR link
2019
--------------------------------------------------------------------------------
ACCELERON IC-609 INSECTICIDE SEED TREATMENT FOR CORN
2009005053
alstar_label.aspx?LabelId=117531
2019
--------------------------------------------------------------------------------
ACCELERON D-342 FUNGICIDE SEED TREATMENT
2015000498
alstar_label.aspx?LabelId=117538
2019
--------------------------------------------------------------------------------
ACCELERON DX-309
2009005026
alstar_label.aspx?LabelId=117559
2019
--------------------------------------------------------------------------------
... and so on.

Get Text from h1 with BeautifulSoup

I was asked to get a product name from a web.
I was asked to get this text:
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
This is my BeautifulSoup code:
import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'lxml')
company = soup.select('h1.it-ttl')[0].text.strip()
print(company)
The HTML from the code is:
<h1 class="it-ttl" id="itemTitle" itemprop="name">
<span class="g-hdn">Details about
</span>
SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
</h1>
Instead of the desired text, I get this:
Details about SEIKO 5 AUTOMATIC MENS STEEL VINTAGE JAPAN MADE BLACK DIAL WATCH RUN ORDER K
How can I extract only the product name?

import requests
from bs4 import BeautifulSoup
get = requests.get('https://www.ebay.com/itm/SEIKO-5-AUTOMATIC-MENS-STEEL-VINTAGE-JAPAN-MADE-BLACK-DIAL-WATCH-RUN-ORDER-K/143420840058?epid=18032713872&_trkparms=ispr%3D1&hash=item21648c587a:g:ZzEAAOSw9MRdsI8v&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8&enc=AQAEAAACQBPxNw%2BVj6nta7CKEs3N0qVBgKB1sCHq6imZgPqwOxGc8125XNy2Dq0slMe8clDZgTSnJdS4K5F5NyTF%2FwJExAng2G2%2FdtRUNYEnKcxoo4WXaAM5K%2BUxqDKTnmNGfgjTzpWCdoE50XlC7BXz3bBrJTY0vo62kBVR03HYvJwVCxnu8NEBiz4YMfAlPWDNnP2lVje46p22rKWDem6rHFqpoKtLDVHS8CaQER%2BqJxucEnw14LJIybRkfCmDuobZv%2F4F9Lhrl8xiPp%2Bbk6iRIu3UqqocBO%2FNyxW1aAa8QWkaJqtUy3g6Yue61yMEb0GY3BwO1%2BpVwkTOZLDvYHXZ%2FZEGNu%2F%2BYznes9jNtctDCr9Xv3QECsXyLDEOeo7LHh1srunEoRvK9T0AkS7oT%2BI3%2B%2BtD5fGnpJJu%2FJ3MdktqvgnTwieipeZTrGsHiQ8iL1nWm0CJcMbe2UUELEG%2BLHPNSSkRcUVBWnoPuOE5FjuyFHR1ujG2TgGLfN8HlO6ZyfNWz0K%2Bc4zjo7wBPnJdffcn6p8kLHWhbFyMyIY1Jc8yZBl20mlA29S%2BN%2Bw0e3uZDHK%2BIyCBctbYgGxaQM6Aevcdx0OcXl%2Fy7aDoRTqhBue9OYrAa3fEQf6ObFqtCbiEiXTioQZZJfrC%2FXfbq36oMTuQAFRvH2ahowGoPhSQkE1Jn73QLI%2FGXVynHIG2KdQSbX4eU%2FgoGy9y5WIvvUL9Xxy4ltNvTtCpjg5XlY8VxDv4M2gsLY3C0SRv7LNELk%2FitBSjfuUjzg%3D%3D&checksum=143420840058aa89790ec2164a5caf16644bb1bfd7c8')
soup = BeautifulSoup(get.text, 'html.parser')
company = soup.select('h1.it-ttl')[0].text.strip()
span_text = soup.select('span.g-hdn')[0].text.strip()
print(company)
print(span_text)
print(company.lstrip(span_text))
Since the span tag is nested in the h1 tag, the necessary step is to extract the span text and remove it from the h1 tag with the lstrip method.

Removing particular content from result parces using beautifulsoup

def get_description(link):
redditFile = urllib2.urlopen(link)
redditHtml = redditFile.read()
redditFile.close()
soup = BeautifulSoup(redditHtml)
desc = soup.find('div', attrs={'class': 'op_gd14 FL'}).text
return desc
This is the code which gives me text from this html
<div class="op_gd14 FL">
<p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br>
Read all announcements in Prestige Estate </p><p> </p>
</div>
This result is fine for me, I just want to exclude the content of
Read all announcements in Prestige Estate
from result, that is desc in my script, if it is present and Ignore if it is not present. How can I do this?

You can use extract() to remove unnecessary tags from the find() result:
descItem = soup.find('div', attrs={'class': 'op_gd14 FL'}) # get the DIV
[s.extract() for s in descItem('a')] # remove <a> tags
return descItem.get_text() # return the text

Just make some changes to last line and add re module
...
return re.sub(r'<a(.*)</a>','',desc)
Output:
'<div class="op_gd14 FL">\n <p><span class="bigT">P</span>restige Estates Projects Ltd has informed BSE that the 18th Annual General Meeting (AGM) of the Company will be held on September 30, 2015.Source : BSE<br><br> \n </p><p>

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

REGEX extract information from EDGAR SC-13 form - python

Related

python BeautifulSoup Wikipedia Webscapping -learning

How to use loop 'find next sibling' until reaching a certain tag when web scraping with beautifulsoup in python?

Beautifulsoup4 - not selcting all instances of span class

Get Text from h1 with BeautifulSoup

Removing particular content from result parces using beautifulsoup

Categories

Resources