I am trying to go through a website and extract some information using Chromedriver. The problem that I have when I use BeautifulSoup is that I can't find a way to extract table inside a class.
The way I am trying to extract the information looks like this:
results = soup.find_all("div", class_="widget widgetLarge fpPerfglissanteclassique")
Is there a way to change this line so that it will only return the Information in <td>...</td> that can be found inside the class?!
Thanks for your answers in advance!
Your results variable contains another BeautifulSoup object (ResultSet) which you can iterate though and call find and find_all on the individual result items.
Like this:
from bs4 import BeautifulSoup
html = """
<div class="widget widgetLarge fpPerfglissanteclassique">
<td>item 1</td>
<td>item 2</td>
<td>item 3</td>
</div>
<div class="widget widgetLarge fpPerfglissanteclassique">
<td>item 4</td>
<td>item 5</td>
<td>item 6</td>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
results = soup.find_all("div", class_="widget widgetLarge fpPerfglissanteclassique")
for result in results:
table_results = result.find_all("td")
print(table_results)
Result:
[<td>item 1</td>, <td>item 2</td>, <td>item 3</td>]
[<td>item 4</td>, <td>item 5</td>, <td>item 6</td>]
If the table is inside this class, you can use this example how to get data from it:
from bs4 import BeautifulSoup
html = """
<div class="widget widgetLarge fpPerfglissanteclassique">
<table>
<tr>
<td>1</td><td>2</td><td>3</td>
</tr>
<tr>
<td>4</td><td>5</td><td>6</td>
</tr>
</table>
</div>
"""
soup = BeautifulSoup(html, "html.parser")
results = soup.find_all(
"div", class_="widget widgetLarge fpPerfglissanteclassique"
)
for result in results: # <-- iterate every result
for row in result.find_all("tr"): # <-- find all rows
cell_data = []
for cell in row.find_all("td"): # <-- find all cells inside row
cell_data.append(cell.text)
print(*cell_data)
Prints:
1 2 3
4 5 6
Related
I want to locate 'td' where text is 'xyz' so that I can find other attributes in the row. I only have 'xyz' with me and want to get other elements in that row.
.
.
.
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
.
.
.
I can get 'xyz' easily by using
required = soup.find('a', text = 'xyz')
print(required[0].text)
but I'm not able to locate 'td' so that I can use find_next_siblings() to get other columns.
Expected output:
xyz
address
phone number
With bs4 4.7.1 combine pseudo classes of :has and :contains to retrieve the row and tds within.
This bit targets the right a tag if present by its text
a:contains("xyz")
You then retrieve the parent row (tr) having this a tag
tr:has(a:contains("xyz"))
And finally use a descendant combinator and td type selector to get all the tds within that row. Using a list comprehension to return the list.
from bs4 import BeautifulSoup as bs
html = '''
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>
'''
soup = bs(html, 'lxml')
items = [item.text.strip() for item in soup.select('tr:has(a:contains("xyz")) td')]
print(items)
If you have modern BeautifulSoup, you can use CSS selector :contains. Then traverse back with find_parent() method.
from bs4 import BeautifulSoup
s = '''
<tr>
<td>Other1</td>
<td>Other1</td>
<td>Other1</td>
</tr>
<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(s, 'lxml')
for td in soup.select_one('a:contains(xyz)').find_parent('tr').select('td'):
print(td.text.strip())
Prints:
xyz
address
phone number
Replace your code with this:
from bs4 import BeautifulSoup
html = '''<tr>
<td>
<a>xyz</a>
</td>
<td>address</td>
<td>phone number</td>
</tr>'''
soup = BeautifulSoup(html, 'lxml')
required = soup.find('a', text = 'xyz')
print(required.text)
td = required.parent
siblingsArray = td.find_next_siblings()
for siblings in siblingsArray:
print(siblings.text)
O/P:
xyz
address
phone number
Where parent is Get immediate parent tag and find_next_siblings return list of next siblings tag.
You can use xpath. find_elements_by_xpath().
https://www.softwaretestingmaterial.com/how-to-locate-element-by-xpath-locator/
I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?
There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...
Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].
Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1
I am trying to get the contents of a html table with beautifulsoup.
when I get to the level of the cell I need to get only the values that are not between the strike parameter
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
so in the case above I would like to return only $0.41. I am using data.get_text() but I do not know how to filter out the $0.45
any ideas on how to do it?
All the solutions above will work. Adding one more method: extract()
From the documentation:
PageElement.extract() removes a tag or string from the tree. It returns the tag or string that was extracted.
You can use it like this (added one more <td> tag to show how it can be used in a loop):
html = '''
<td>
<strike>
$0.45
</strike>
<br/>
$0.41
</td>
<td>
<strike>
$0.12
</strike>
<br/>
$0.14
</td>
'''
soup = BeautifulSoup(html, 'html.parser')
for td in soup.find_all('td'):
td.strike.extract()
print(td.text.strip())
Output:
$0.41
$0.14
You can look at all NavigableString children of the TD tag and ignore all other elements:
textData = ''.join(x for x in soup.find('td').children \
if isinstance(x, bs4.element.NavigableString)).strip()
#'$0.41'
You can do the same in several ways. Here is one such way:
from bs4 import BeautifulSoup
content="""
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
"""
soup = BeautifulSoup(content,"lxml")
item = soup.find("td").contents[-1].strip()
print(item)
Output:
$0.41
You can do this in the following way
from bs4 import BeautifulSoup
h = '''
<td>
<strike>$0.45</strike><br/>
$0.41
</td>
'''
soup = BeautifulSoup(h, 'lxml')
a = soup.find('td').get_text()
print(a.split('\n')[2].strip())
Split it with Enter and delete both spaces.
html = '''
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
Given this HTML, I would like to specifically parse the country name and the capital city name and put them into a dictionary so that I can get
dict["Afghanistan] = 'Kabul'
I've started by doing
soup = BeautifulSoup(open(filename), 'lxml')
countries = {}
# YOUR CODE HERE
table = soup.find_all('table')
for each in table:
if each.find('tr'):
continue
else:
print(each.prettify())
return countries
But it's confusing since it's the first time using it.
You can select the "tr" elements if they have two "td" child elements you have your data:
from bs4 import BeautifulSoup
html = """
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead><tr><th>Country</th><th>Capital city</th></tr></thead><tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
"""
soup = BeautifulSoup(html, 'lxml')
countries = {}
trs = soup.find_all('tr')
for tr in trs:
tds = tr.find_all("td")
if len (tds) ==2:
countries[tds[0].text] = tds[1].text
print (countries)
Outputs:
{'Afghanistan': 'Kabul', 'Albania': 'Tirana'}
The solution is for the given html example:
from bs4 import BeautifulSoup # assuming you did pip install bs4
soup = BeautifulSoup(html, "html.parser") # the html you mentioned
table_data = soup.find('table')
data = {} # {'country': 'capital'} dict
for row in table_data.find_all('tr'):
row_data = row.find_all('td')
if row_data:
data[row_data[0].text] = row_data[1].text
I've skipped the try, except block for any erroneous case. I suggest to go through documentation of BeautifulSoup, it covers everything.
How about this:
from bs4 import BeautifulSoup
element ='''
<div class="container">
<h2>Countries & Capitals</h2>
<table class="two-column td-red">
<thead>
<tr><th>Country</th><th>Capital city</th></tr>
</thead>
<tbody>
<tr class="grey"><td>Afghanistan</td><td>Kabul</td></tr>
<tr><td>Albania</td><td>Tirana</td></tr>
</tbody>
</table>
</div>
'''
soup = BeautifulSoup(element, 'lxml')
countries = {}
for data in soup.select("tr"):
elem = [item.text for item in data.select("th,td")]
countries[elem[0]] = elem[1]
print(countries)
Output:
{'Afghanistan': 'Kabul', 'Country': 'Capital city', 'Albania': 'Tirana'}
I am using beautiful soup to grab data from an html page, and when I grab the data, I am left with this:
<tr>
<td class="main rank">1</td>
<td class="main company"><a href="/colleges/williams-college/">
<img alt="" src="http://i.forbesimg.com/media/lists/colleges/williams-college_50x50.jpg">
<h3>Williams College</h3></img></a></td>
<td class="main">Massachusetts</td>
<td class="main">$61,850</td>
<td class="main">2,124</td>
</tr>
This is the beautifulsoup command I am using to get this:
html = open('collegelist.html')
test = BeautifulSoup(html)
soup = test.find_all('tr')
I now want to manipulate this text so that it outputs
1
Williams College
Massachusetts
$62,850
2,214
and I having difficulty doing so for the entire document, where I have about 700 of these entries. Any advice would be appreciated.
Just get the .text (or use get_text()) for every tr in the loop:
soup = BeautifulSoup(open('collegelist.html'))
for tr in soup.find_all('tr'):
print tr.text # or tr.get_text()
For the HTML you've provided it prints:
1
Williams College
Massachusetts
$61,850
2,124
use get_text()
soup = BeautifulSoup(html)
"".join([x.get_text() for x in soup.find_all('tr')])