Parsing webpage in Python using Beautiful Soup

Parsing webpage in Python using Beautiful Soup - python

This is the Web page Source code which I am scraping using Beautiful Soup.
<tr>
<td>
1
</td>
<td style="cipher1">
<img class="cipher2" src="http://cipher3.png" alt="cipher4" title="cipher5" />
<span class="cipher8">t</span>cipher9
</td>
<td>
112
</td>
<td>
3510
</td>
// Pattern Repeated
<tr >
<td>
2
</td>
<td style="cipher1">
I wrote some code using BeautifulSoup but I am getting more results than I want due to multiple occurrences of the pattern.
I have used
row1 = soup.find_all('a' ,class_ = "cipher7" )
for row in row1:
f.write( row['title'] + "\n")
But with this I get multiple occurences for 'cipher7' since it is occurring multiple times in the web page.
So the thing I can use this
<td style="cipher1">...
since it is unique to the things which I want.
So, How to modify my code to do this?

You can use a convenient select method which takes a CSS selector as an argument:
row = soup.select("td[style=cipher1] > a.cipher7")

You can first find the td tag (since you said it is unique) and then find the specified atag from it.
all_as = []
rows = soup.find_all('td', {'style':'cipher1'})
for row in rows:
all_as.append(row.find_all('a', class_ = "cipher7"))

Related

Python: Accessing a new <tr> while inside a different <tr> with BeautifulSoup4

I am trying to gather some data by webscraping a local HTML file using BeautifulSoup4. The problem is, that the information I'm trying to get is on different rows that have the same class tags. I'm not sure about how to access them. The following html screenshot contains the two rows I'm accessing with the data I need highlighted (sensitive info is scribbled out).
The code I have currently is:
def find_data(fileName):
with open(fileName) as html_file:
soup = bs(html_file, "lxml")
hline1 = soup.find("td", class_="headerTableEntry")
hline2 = hline1.find_next_sibling("td")
hline3 = hline2.find_next_sibling("td")
hline4 = hline3.find_next_sibling("td", class_="headerTableEntry")
line1 = hline1.text
line2 = hline2.text
line3 = hline3.text
#Nothing yet for lines 4,5,6
The first 3 lines work great and give 13, 39, and 33.3% as they should. But for line 4 (which should be the second tag and first tag with class=headerTableEntry) I get an error "'NoneType' object is not callable".
My question is, is there a different way to go at this so I can access all 6 data cells or is there a way to edit how I wrote line 4 to work? Thank you for your help, it is very much appreciated!

The <tr> tag is not inside another <tr> tag as you can see that first <tr> tag is closed with the </tr> So that next <td> is not a sibling of the previous, hence it returns None. It's within the next <tr> tag.
Pandas is a great package to parse html <table> tags (which this is). It actually uses beautifulsoup under the hood. Just get the full table, and slice the table for the columns you want:
html_file = '''<table>
<tr>
<td class="headerName">File:</td>
<td class="HeaderValue">Some Value</td>
<td></td>
<td class="headerName">Lines:</td>
<td class="headerTableEntry">13</td>
<td class="headerTableEntry">39</td>
<td class="headerTableEntry" style="back-ground-color:LightPink">33.3 %</td>
</tr>
<tr>
<td class="headerName">Date:</td>
<td class="HeaderValue">2020-06-18 11:15:19</td>
<td></td>
<td class="headerName">Branches:</td>
<td class="headerTableEntry">10</td>
<td class="headerTableEntry">12</td>
<td class="headerTableEntry" style="back-ground-color:#FFFF55">83.3 %</td>
</tr>
</table>'''
import pandas as pd
df = pd.read_html(html_file)[0]
df = df.iloc[:,3:]
So for your code:
def find_data(fileName):
with open(fileName) as html_file:
df = pd.read_html(html_file)[0].iloc[:,3:]
print (df)
Output:
print (df)
3 4 5 6
0 Lines: 13 39 33.3 %
1 Branches: 10 12 83.3 %

Trying to print a single line of a table using BeautifulSoup, but the line location keeps changing

So I'm trying to get a table row to print out using BeautifulSoup, but I can't just use the ID of the row because the location of the row can change depending on a couple different variables. The rows all have names like trRow_1. What I need it to do is to print out the row that contains the text I'm looking for since it moves.
I cannot figure out the wording for me to print out the desired line using if statements.
This is what I've tried, which obviously doesn't work but should give you the idea of what I want:
table = soup1.find("table", id="tblActivities")
tablerow = table.findAll("tr")
TextIwant = tablerow.find(<span>"The Text I Want"</span>)
print(TextIWant)
Any idea of how to do this?
This is the row element I'm working with:
<tr id="trRow_5" class="changeTrOnhover" uniqueid="" rowid="2200005" action="0" postype="0" levelclass="2200005" riskcountry="United States" issuecurrency="" riskregion="" seq="5">
<!-- End positionDetail greater than 0 -->
<td>
<span class="bold"> Cash Equivalent
</span>
</td> <!-- Asset class desc -->
<td><span></span></td> <!-- price -->
<td><span></span></td> <!-- quantity -->
<!-- START PSI19 US77980 Populate values for Investment cost -->
<td class="bold"><span>
<span>52,896.91 USD
</span></span></td>
<!-- END PSI19 US77980 Populate values for Investment cost -->
<!-- base mkt -->
<td class="bold"><span>
52,896.91 USD
</span></td>
<!-- local mkt -->
<!-- perc of class -->
<td nowrap="">
<span class="bold">
6.88
</span>
</td>
<!-- perc of total mkt -->
<!-- income yield -->
<!-- moodys -->
<td><span></span></td> <!-- action -->
<!-- positionDetail = 0 -->
</tr>
soup.select_one('table#tblActivities').select('tr:has(td:contains("Cash Equivalent")) td')
This returns all of the table rows.
for td in table.select('tr:has(td:contains("Cash Equivalent")) td'):
print(td.text.strip())
This returns all of the rows in the table as well.

I'm not sure what the problem is exactly. #Andrej Kesely's solution works for me. A simplified version of his solution also works:
soup = bs([your html above],'html5')
for element in soup.select('tr:has(span:contains("Cash Equivalent"))'):
print(element.text.replace('\n','').strip())
And if you change tactics and replace css selection with find() methods
tab = soup.find('table',id='tblActivities')
row = soup.find(lambda tag:tag.name=="span" and "Cash Equivalent" in tag.text)
for i in row:
print(i.parent.parent.parent.text.strip().replace('\n',''))
That also works. In all these cases, the output is:
Cash Equivalent
52,896.91 USD
52,896.91 USD
6.88
which, I believe, is what you are looking for.

You can use CSS selector :has() and :contains() to select row with td that contains selected text:
data = '''
<table id="tblActivities">
<tr>
<td>I Dont want this</td>
<td>I Dont want this</td>
<td>I Dont want this</td>
</tr>
<tr>
<td>Some data</td>
<td><span>The Text I Want</span></td>
<td>Some data</td>
</tr>
<tr>
<td>I Dont want this</td>
<td>I Dont want this</td>
<td>I Dont want this</td>
</tr>
</table>
'''
from bs4 import BeautifulSoup
soup = BeautifulSoup(data, 'lxml')
table = soup.select_one('table#tblActivities')
for td in table.select('tr:has(td:contains("The Text I Want")) td'):
print(td.text)
Prints:
Some data
The Text I Want
Some data
Further reading:
CSS Selector reference

Unable to grab a text from soup

I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5（+1.33%）</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?

There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...

Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5（+1.33%）</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].

Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1

Extract data from an HTML tag which has specific attributes

<tr bgcolor="#FFFFFF">
<td class="tablecontent" scope="row" rowspan="1">
ZURICH AMERICAN INSURANCE COMPANY
</td>
<td class="tablecontent" scope="row" rowspan="1">
FARMERS GROUP INC (14523)
</td>
<td class="tablecontent" scope="row">
znaf
</td>
<td class="tablecontent" scope="row">
anhm
</td>
</tr>
I have an HTML document which contains multiple tr tags. I want to extract the href link from the first td and data from third td tag onwards under every tr tag. How can this be achieved?

You can find all tr elements, iterate over them, then do the context-specific searches for the inner td elements and get the first and the third:
for tr in soup.find_all('tr'):
cells = tr.find_all('td')
if len(cells) < 3:
continue # safety pillow
link = cells[0].a['href'] # assuming every first td has an "a" element
data = cells[2].get_text()
print(link, data)
As a side note and depending what you are trying to accomplish in the HTML parsing, I usually find pandas.read_html() a great and convenient way to parse HTML tables into dataframes and work with the dataframes after, which are quite convenient data structures to work with.

You can use css selector nth-of-type to navigate through the tds
Here's a sample"
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('td:nth-of-type(1) a')[0]
href = a['href']
td = soup.select("td:nth-of-type(3)")[0]
text = td.get_text(strip=True)

How to extract info from varying table entries: Text vs. DIV vs. SPAN

I am new to python and searched the internet to find an answer to my problem, but so far I failed...
The problem: My aim is to extract data from websites. More specifically, from the tables in these websites. The relevant snippet from the website-code you find in "data" in my python-code example here:
from bs4 import BeautifulSoup
data = '''<table class="ds-table">
<tr>
<td class="data-label">year of birth:</td>
<td class="data-value">1994</td>
</tr>
<tr>
<td class="data-label">reporting period:</td>
<td class="data-value">
<span class="editable" id="c-scope_beginning_date">
? </span>
-
<span class="editable" id="c-scope_ending_date">
? </span>
</td>
</tr>
<tr>
<td class="data-label">reporting cycle:</td>
<td class="data-value">
<span class="editable" id="c-periodicity">
- </span>
</td>
</tr>
<tr>
<td class="data-label">grade:</td>
<td class="data-value">1.3, upper 10% of class</td>
</tr>
<tr>
<td class="data-label">status:</td>
<td class="data-value"></td>
</tr>
</table>
<table class="ds-table">
<tr>
<td class="data-label">economics:</td>
<td class="data-value"><span class="positive-value"></span></td>
</tr>
<tr>
<td class="data-label">statistics:</td>
<td class="data-value"><span class="negative-value"></span></td>
</tr>
<tr>
<td class="data-label">social:</td>
<td class="data-value"><div id="music_id" class="trigger"><span class="negative-value"></span></div></td>
</tr>
<tr>
<td class="data-label">misc:</td>
<td class="data-value">
<div id="c_assurance" class="">
<span class="positive-value"></span> </div>
</td>
</tr>
<tr>
<td class="data-label">recommendation:</td>
<td class="data-value">
<span class="negative-value"></span> </td>
</tr>
</table>'''
soup = BeautifulSoup(data)
For the class="data-label" so far I successfully implemented...
box_cdl = []
for i, cdl in enumerate(soup.findAll('td', attrs={'class': 'data-label'})):
box_cdl.append(cdl.contents[0])
print box_cdl
...which extracts the text from the columns, in the (for me satisfying) output:
[u'year of birth:',
u'reporting period:',
u'reporting cycle:',
u'grade:',
u'status:',
u'economics:',
u'statistics:',
u'social:',
u'misc:',
u'recommendation:']
Where I get stuck is the part for class="data-value" with the div- and span-fields and that some of the relevant information is hidden in the span-class. Moreover, the amount of the tr-rows can change from website to website, e.g. "status" comes after "reporting cycle" (instead of "grade").
However, when I do...
box_cdv = []
for j, cdv in enumerate(soup.findAll('td', attrs={'class': 'data-value'})):
box_cdv.append(cdv.contents[0])
print box_cdv
...I get the error:
Traceback (most recent call last):
File "<ipython-input-53-7d5c095cf647>", line 3, in <module>
box_cdv.append(cdv.contents[0])
IndexError: list index out of range
What I would like to get instead is something like this (corresponding to the above "data"-example):
[u'1994',
u'? - ?',
u'-',
u'1.3, upper 10% of class',
u'',
u'positive-value',
u'negative-value',
u'negative-value',
u'positive-value',
u'negative-value']
The Question: how can I extract this information and collect the relevant data from each tr-row, given that the adequate extraction-code depends on the type of the category (year of birth, reporting period, ..., recommendation)?
Or, asking differently: what code extracts me, depending on the category (year of birth, reporting period, ..., recommendation), the corresponding value (1994, ..., negative-value)?
Since the amount and the type of the table-entries can differ between websites, a simple "on the i-th entry do the following" procedure is not applicable. The thing I am looking for I think is something like "if you find the text "recommendation:", then extract the class-type from the span-field", I guess. But unfortunately I do not have any clue how to translate that into python-language.
Any help is highly appreciated.

You get that error because one of the tags don't have any children so the contents list gives an error when searching for that index.
You can approeach this on the following way:
1) Search for the data-label tags;
2) Find the next TD sibling;
3 A) Check of the sibling has text;
3 A) 1) If so create a dict entry with data-label as the key and the sibling text as its value;
3 A) B) If not check if the sibling first child have a class containing -value`
4) Parse the data.
Example:
soup = BeautifulSoup(data, 'lxml')
result = {}
for tag in soup.find_all("td", { "class" : "data-label" }):
NextSibling = tag.find_next("td", { "class" : "data-value" }).get_text(strip = True)
if not NextSibling and len(tag.find_next("td").select('span[class*=-value]')) > 0:
NextSibling = tag.find_next("td").select('span[class*=-value]')[0]["class"][0]
result[tag.get_text(strip = True)] = NextSibling
print (result)
Result:
{
'year of birth:': '1994',
'reporting period:': '?-?',
'reporting cycle:': '-',
'grade:': '1.3, upper 10% of class',
'status:': '',
'economics:': 'positive-value',
'statistics:': 'negative-value',
'social:': 'negative-value',
'misc:': 'positive-value',
'recommendation:': 'negative-value'
}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parsing webpage in Python using Beautiful Soup - python

You can use a convenient select method which takes a CSS selector as an argument: row = soup.select("td[style=cipher1] > a.cipher7")

You can first find the td tag (since you said it is unique) and then find the specified atag from it. all_as = [] rows = soup.find_all('td', {'style':'cipher1'}) for row in rows: all_as.append(row.find_all('a', class_ = "cipher7"))

Related

Python: Accessing a new <tr> while inside a different <tr> with BeautifulSoup4

Trying to print a single line of a table using BeautifulSoup, but the line location keeps changing

Unable to grab a text from soup

Extract data from an HTML tag which has specific attributes

How to extract info from varying table entries: Text vs. DIV vs. SPAN

Categories

Resources