Deleting all content between brackets from a string using python - python

I am using beautiful soup to grab data from an html page, and when I grab the data, I am left with this:
<tr>
<td class="main rank">1</td>
<td class="main company"><a href="/colleges/williams-college/">
<img alt="" src="http://i.forbesimg.com/media/lists/colleges/williams-college_50x50.jpg">
<h3>Williams College</h3></img></a></td>
<td class="main">Massachusetts</td>
<td class="main">$61,850</td>
<td class="main">2,124</td>
</tr>
This is the beautifulsoup command I am using to get this:
html = open('collegelist.html')
test = BeautifulSoup(html)
soup = test.find_all('tr')
I now want to manipulate this text so that it outputs
1
Williams College
Massachusetts
$62,850
2,214
and I having difficulty doing so for the entire document, where I have about 700 of these entries. Any advice would be appreciated.

Just get the .text (or use get_text()) for every tr in the loop:
soup = BeautifulSoup(open('collegelist.html'))
for tr in soup.find_all('tr'):
print tr.text # or tr.get_text()
For the HTML you've provided it prints:
1
Williams College
Massachusetts
$61,850
2,124

use get_text()
soup = BeautifulSoup(html)
"".join([x.get_text() for x in soup.find_all('tr')])

Related

Cannot scrape the table for href links

I am trying to pull the href links from the table which I later need to click one by one to access the data inside each links. But I cannot figure out a way to do it. I have tried find_all and have been getting "ResultSet object has no attribute '%s' error.
HTML: (Really long so this is a 1/10)
<thead>
<tr class="sctablehead">
<th>Academic Program</th>
<th>Departments</th>
<th>Academic Level</th>
<th>College</th>
<th>Online</th>
<th>Degree Type</th>
</tr>
</thead>
<tbody>
<tr class="even firstrow"><td>Accountancy</td><td>Accounting</td><td>Graduate</td><td>BUS</td><td></td><td>MAC</td></tr>
<tr class="odd"><td>Accounting</td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td>Accounting</td><td>Accounting</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td>Actuarial Science</td><td>Mathematics, Economics, Finance</td><td>Undergraduate</td><td>STEM</td><td></td><td>Minor</td></tr>
<tr class="even"><td>Adult Gerontology Acute Care Nurse Practitioner</td><td>Nursing</td><td>Graduate</td><td>HHS</td><td></td><td>PMC</td></tr>
<tr class="odd"><td>Advertising and Public Relations</td><td>Advertising</td><td>Undergraduate</td><td>BUS</td><td></td><td>BSB</td></tr>
<tr class="even"><td>Advertising Public Relations</td><td>Marketing</td><td>Undergraduate</td><td>BUS</td><td></td><td>Minor</td></tr>
<tr class="odd"><td>Aerospace Studies</td><td>Aerospace Studies</td><td>Undergraduate</td><td>HHS</td><td></td><td>Minor</td></tr>
<tr class="even"><td>Africana Studies</td><td>Africana Studies</td><td>Undergraduate</td><td>BCLASSE</td><td></td><td>Minor</td></tr>
... And so on
My Code:
r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})
for tr in programs_table.find_all('tr class'):
for a in tr.find_all('a'):
print(a['href'])
If your table is found properly(as you have not provided the html for that..)
Then ONLY:-
r = requests.get(driver.current_url)
soup = bs(r.content, 'html.parser')
programs_table = soup.find_all('table', {"class":"sc_sctable tbl_degrees sorttable"})
for tr in programs_table.find_all('tr'):
for a in tr.find_all('a'):
print(a['href'])
In other words you can try programs_table.find_all("tr") instead of programs_table.find_all("tr class")
Because I get results after using this as follows:
/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/bsba-in-accounting/
/undergraduate/colleges-programs/college-business-administration/school-accounting-finance/accounting-minor/
/undergraduate/colleges-programs/college-science-technology-engineering-mathematics/department-mathematics-statistics/actuarial-science-minor/
/graduate/graduate-programs/post-masters-adult-gero-acute-care-nurse-pract-certificate-program/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations/
/undergraduate/colleges-programs/college-business-administration/department-marketing/advertising-public-relations-minor/
/undergraduate/colleges-programs/college-health-human-services/aerospace-studies-program/```
First of all, you shouldn't be using find_all to scrape one tag unless you really want it to be in a list. So, to get the table you should just to do:
programs_table = soup.find('table', class_="sc_sctable")
Now to get inner <a> tags with href links, you can scrape <td> tags which have inner <a> tag:
tags_with_href = programs_table.tbody.find_all('td')
links = [each_tag.a['href'] for each_tag in tags_with_href if each_tag.a]
# -> ['/graduate/graduate-programs/master-accountancy/', ... ]
If you want to have absolute url instead of relative one, you can define base_url and add each relative urls to it:
base_url = '<base_url_of_website>'
links = [base_url + link for link in links]

Get contents of a particular row

I want to locate 'td' where text is 'xyz' so that I can find other attributes in the row. I only have 'xyz' with me and want to get other elements in that row.
.
.
.
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
.
.
.
I can get 'xyz' easily by using
required = soup.find('a', text = 'xyz')
print(required[0].text)
but I'm not able to locate 'td' so that I can use find_next_siblings() to get other columns.
Expected output:
xyz
address
phone number
With bs4 4.7.1 combine pseudo classes of :has and :contains to retrieve the row and tds within.
This bit targets the right a tag if present by its text
a:contains("xyz")
You then retrieve the parent row (tr) having this a tag
tr:has(a:contains("xyz"))
And finally use a descendant combinator and td type selector to get all the tds within that row. Using a list comprehension to return the list.
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('tr:has(a:contains("xyz")) td')]
print(items)
If you have modern BeautifulSoup, you can use CSS selector :contains. Then traverse back with find_parent() method.
from bs4 import BeautifulSoup
s = '''
<tr>
<td>Other1</td>
<td>Other1</td>
<td>Other1</td>
</tr>
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(s, 'lxml')
for td in soup.select_one('a:contains(xyz)').find_parent('tr').select('td'):
print(td.text.strip())
Prints:
xyz
address
phone number
Replace your code with this:
from bs4 import BeautifulSoup
html = '''<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(html, 'lxml')
required = soup.find('a', text = 'xyz')
print(required.text)
td = required.parent
siblingsArray = td.find_next_siblings()
for siblings in siblingsArray:
print(siblings.text)
O/P:
xyz
address
phone number
Where parent is Get immediate parent tag and find_next_siblings return list of next siblings tag.
You can use xpath. find_elements_by_xpath().
https://www.softwaretestingmaterial.com/how-to-locate-element-by-xpath-locator/

Unable to grab a text from soup

I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?
There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...
Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].
Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1

exclude tags with beautifulsoup

I am trying to get the contents of a html table with beautifulsoup.
when I get to the level of the cell I need to get only the values that are not between the strike parameter
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
so in the case above I would like to return only $0.41. I am using data.get_text() but I do not know how to filter out the $0.45
any ideas on how to do it?
All the solutions above will work. Adding one more method: extract()
From the documentation:
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
You can use it like this (added one more <td> tag to show how it can be used in a loop):
html = '''
<td>
<strike>
$0.45
</strike>
<br/>
$0.41
</td>
<td>
<strike>
$0.12
</strike>
<br/>
$0.14
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
for td in soup.find_all('td'):
td.strike.extract()
print(td.text.strip())
Output:
$0.41
$0.14
You can look at all NavigableString children of the TD tag and ignore all other elements:
textData = ''.join(x for x in soup.find('td').children \
if isinstance(x, bs4.element.NavigableString)).strip()
#'$0.41'
You can do the same in several ways. Here is one such way:
from bs4 import BeautifulSoup
content="""
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
"""
soup = BeautifulSoup(content,"lxml")
item = soup.find("td").contents[-1].strip()
print(item)
Output:
$0.41
You can do this in the following way
from bs4 import BeautifulSoup
h = '''
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
'''
soup = BeautifulSoup(h, 'lxml')
a = soup.find('td').get_text()
print(a.split('\n')[2].strip())
Split it with Enter and delete both spaces.

Extract data from an HTML tag which has specific attributes

<tr bgcolor="#FFFFFF">
<td class="tablecontent" scope="row" rowspan="1">
ZURICH AMERICAN INSURANCE COMPANY
</td>
<td class="tablecontent" scope="row" rowspan="1">
FARMERS GROUP INC (14523)
</td>
<td class="tablecontent" scope="row">
znaf
</td>
<td class="tablecontent" scope="row">
anhm
</td>
</tr>
I have an HTML document which contains multiple tr tags. I want to extract the href link from the first td and data from third td tag onwards under every tr tag. How can this be achieved?
You can find all tr elements, iterate over them, then do the context-specific searches for the inner td elements and get the first and the third:
for tr in soup.find_all('tr'):
cells = tr.find_all('td')
if len(cells) < 3:
continue # safety pillow
link = cells[0].a['href'] # assuming every first td has an "a" element
data = cells[2].get_text()
print(link, data)
As a side note and depending what you are trying to accomplish in the HTML parsing, I usually find pandas.read_html() a great and convenient way to parse HTML tables into dataframes and work with the dataframes after, which are quite convenient data structures to work with.
You can use css selector nth-of-type to navigate through the tds
Here's a sample"
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('td:nth-of-type(1) a')[0]
href = a['href']
td = soup.select("td:nth-of-type(3)")[0]
text = td.get_text(strip=True)

Categories