Parsing HTML table with LXML in Python - python

I need to parse html table of the following structure:
<table class="table1" width="620" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr width="620">
<th width="620">Smth1</th>
...
</tr>
<tr bgcolor="ffffff" width="620">
<td width="620">Smth2</td>
...
</tr>
<tr bgcolor="E4E4E4" width="620">
<td width="620">Smth3</td>
...
</tr>
<tr bgcolor="ffffff" width="620">
<td width="620">Smth4</td>
...
</tr>
</tbody>
</table>
Python code:
r = requests.post(url,data)
html = lxml.html.document_fromstring(r.text)
rows = html.xpath(xpath1)[0].findall("tr")
#Getting Xpath with FireBug
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
But I get this on the third line:
IndexError: list index out of range
The task is to form python dict from this. Number of rows could be different.
UPD.
Changed the way I'm getting html code to avoid possible problems with requests lib. Now it's a simple url:
html = lxml.html.parse(test_url)
This proves everyting is Ok with html:
lxml.html.open_in_browser(html)
But still the same problem:
rows = html.xpath(xpath1)[0].findall('tr')
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
Here is the xpath1:
'/html/body/table/tbody/tr[5]/td/table/tbody/tr/td[2]/table/tbody/tr/td/center/table'
UPD2. It was found experimentally, that xpath crashes on:
xpath1 = '/html/body/table/tbody'
print html.xpath(xpath1)
#print returns []
If xpath1 is shorter, then it seeem to work well and returns [<Element table at 0x2cbadb0>] for xpath1 = '/html/body/table'

You didn't include the XPath, so I'm not sure what you're trying to do, but if I understood correctly, this should work
xpath1 = "tbody/tr"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
rows = html.xpath(xpath1)
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
This is making a list of one item lists though, like this:
[['Smth1'], ['Smth2'], ['Smth3'], ['Smth4']]
To have a simple list of the values, you can use this code
xpath1 = "tbody/tr/*/text()"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
data = html.xpath(xpath1)
This is all assuming that r.text is exactly what you posted up there.

Your .xpath(xpath1) XPath expression failed to find any elements. Check that expression for errors.

Related

Python: Accessing a new <tr> while inside a different <tr> with BeautifulSoup4

I am trying to gather some data by webscraping a local HTML file using BeautifulSoup4. The problem is, that the information I'm trying to get is on different rows that have the same class tags. I'm not sure about how to access them. The following html screenshot contains the two rows I'm accessing with the data I need highlighted (sensitive info is scribbled out).
The code I have currently is:
def find_data(fileName):
with open(fileName) as html_file:
soup = bs(html_file, "lxml")
hline1 = soup.find("td", class_="headerTableEntry")
hline2 = hline1.find_next_sibling("td")
hline3 = hline2.find_next_sibling("td")
hline4 = hline3.find_next_sibling("td", class_="headerTableEntry")
line1 = hline1.text
line2 = hline2.text
line3 = hline3.text
#Nothing yet for lines 4,5,6
The first 3 lines work great and give 13, 39, and 33.3% as they should. But for line 4 (which should be the second tag and first tag with class=headerTableEntry) I get an error "'NoneType' object is not callable".
My question is, is there a different way to go at this so I can access all 6 data cells or is there a way to edit how I wrote line 4 to work? Thank you for your help, it is very much appreciated!
The <tr> tag is not inside another <tr> tag as you can see that first <tr> tag is closed with the </tr> So that next <td> is not a sibling of the previous, hence it returns None. It's within the next <tr> tag.
Pandas is a great package to parse html <table> tags (which this is). It actually uses beautifulsoup under the hood. Just get the full table, and slice the table for the columns you want:
html_file = '''<table>
<tr>
<td class="headerName">File:</td>
<td class="HeaderValue">Some Value</td>
<td></td>
<td class="headerName">Lines:</td>
<td class="headerTableEntry">13</td>
<td class="headerTableEntry">39</td>
<td class="headerTableEntry" style="back-ground-color:LightPink">33.3 %</td>
</tr>
<tr>
<td class="headerName">Date:</td>
<td class="HeaderValue">2020-06-18 11:15:19</td>
<td></td>
<td class="headerName">Branches:</td>
<td class="headerTableEntry">10</td>
<td class="headerTableEntry">12</td>
<td class="headerTableEntry" style="back-ground-color:#FFFF55">83.3 %</td>
</tr>
</table>'''
import pandas as pd
df = pd.read_html(html_file)[0]
df = df.iloc[:,3:]
So for your code:
def find_data(fileName):
with open(fileName) as html_file:
df = pd.read_html(html_file)[0].iloc[:,3:]
print (df)
Output:
print (df)
3 4 5 6
0 Lines: 13 39 33.3 %
1 Branches: 10 12 83.3 %

Parse info from tables with Scrapy and XPath

I am trying to extract attributes from a website with scrapy and xpath:
response.xpath('//section[#id="attributes"]/div/table/tbody/tr/td/text()').extract()
The attributes are nested in the following way:
<section id="attributes">
<h5>Attributes</h5>
<div>
<table>
<tbody>
<tr>
<td>Attribute 1</td>
<td>Value 1</td>
</tr>
<tr>
<td>Attriburte 2</td>
<td>Value 2</td>
</tr>
There are two problems associated with this:
Get the content of the td elements (the XPath command will return[])
Once the td is retrieved, I need to get the pairing somehow. e.g.: "Attribute 1" = "Value 1"
I am new to phyton and scrapy, any help is greatly appreciated.
First of all you should try to remove tbody tag from XPath as usually it's not in page source.
You can update your code as below:
cells = response.xpath('//section[#id="attributes"]/div/table//tr/td/text()').extract()
att_values = [{first: second} for first, second in zip(cells[::2], cells[1::2])]
You will get list of attribute-value pairs:
[{attr_1: value_1}, {attr_2: value_2}, {attr_3: value_3}, ...]
or
att_values = {first: second for first, second in zip(cells[::2], cells[1::2])}
# or:
# att_values = dict( zip(cells[::2], cells[1::2]) )
to get dictionary
{attr_1: value_1, attr_2: value_2, attr_3: value_3, ...}
Try:
for row in response.css('section#attributes table tr'):
td1 = row.xpath('.//td[1]/text()').get()
td2 = row.xpath('.//td[2]/text()').get()
# your logic further

How to solve an attribute error when using beautifulsoup?

I am reading in a .html file that looks similar to the following format:
html = '''
<tr>
<td class="SmallFormText" colspan="3">hours per response:</td><td class="SmallFormTextR">23.8</td>
</tr>
<hr>
<table width="100%" border="0" cellspacing="0" cellpadding="4" summary="Form 13F-NT Header Information">
<tbody>
<tr>
<td class="FormTextC">COLUMN 1</td><td class="FormTextC">COLUMN 2</td><td class="FormTextC">COLUMN 3</td><td class="FormTextR">COLUMN 4</td><td class="FormTextC" colspan="3">COLUMN 5</td><td class="FormTextC">COLUMN 6</td><td class="FormTextR">COLUMN 7</td><td class="FormTextC" colspan="3">COLUMN 8</td>
</tr>
<tr>
<td class="FormText"></td><td class="FormText"></td><td class="FormText"></td><td class="FormTextR">VALUE</td><td class="FormTextR">SHRS OR</td><td class="FormText">SH/</td><td class="FormText">PUT/</td><td class="FormText">INVESTMENT</td><td class="FormTextR">OTHER</td><td class="FormTextC" colspan="3">VOTING AUTHORITY</td>
</tr>
<tr>
<td class="FormText">NAME OF ISSUER</td><td class="FormText">TITLE OF CLASS</td><td class="FormText">CUSIP</td><td class="FormTextR">(x$1000)</td><td class="FormTextR">PRN AMT</td><td class="FormText">PRN</td><td class="FormText">CALL</td><td class="FormText">DISCRETION</td><td class="FormTextR">MANAGER</td><td class="FormTextR">SOLE</td><td class="FormTextR">SHARED</td><td class="FormTextR">NONE</td>
</tr>
<tr>
<td class="FormData">1ST SOURCE CORP</td><td class="FormData">COM</td><td class="FormData">336901103</td><td class="FormDataR">8</td><td class="FormDataR">335</td><td class="FormData">SH</td><td> </td><td class="FormData">SOLE</td><td class="FormData">7</td><td class="FormDataR">335</td><td class="FormDataR">0</td><td class="FormDataR">0</td>
</tr>
<tr>
<td class="FormData">1ST UNITED BANCORP INC FLA</td><td class="FormData">COM</td><td class="FormData">33740N105</td><td class="FormDataR">7</td><td class="FormDataR">989</td><td class="FormData">SH</td><td> </td><td class="FormData">SOLE</td><td class="FormData">7</td><td class="FormDataR">989</td><td class="FormDataR">0</td><td class="FormDataR">0</td>
</tr> '''
In this code, I am trying to extract the information between the < tr > and < /tr > tags. In particular, I want to assign a given information, such as "NAME OF ISSUER" to a column name called "NAME_OF_ISSUER", using beautiful soup. However, when I run the following code, I am facing an error that looks simple to be solved (it's more or less a data formatting issue). Given that I am new to Python, I got stuck for a few hours trying alternative solutions. I would appreciate any comments or feedback.
Here is my code (please run the above code as well to obtain the html data):
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')[11:]
positions = []
dic = {}
position = rows.find_all('td')
dic["NAME_OF_ISSUER"] = position[0].text
dic["CUSIP"] = position[2].text
dic["VALUE"] = int(position[3].text.replace(',', ''))*1000
dic["SHARES"] = int(position[4].text.replace(',', ''))
positions.append(dic)
df = pd.DataFrame(positions)
I am getting an "AttributeError" right after defining position, stating that the list object has no attribute "find_all".
What exactly does this mean? Also, how would I need to transform the html data to avoid this issue?
Edited part:
Here is the full stack trace:
position = rows.find_all('td')
Traceback (most recent call last):
File "<ipython-input-8-37353b5ab2ef>", line 1, in <module>
position = rows.find_all('td')
AttributeError: 'list' object has no attribute 'find_all'
soup.find_all returns a python list of elements. All you need to do is iterate through the list and grab data from those elements.
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'lxml')
rows = soup.find_all('tr')
# scan for header row and trim list
for index, row in enumerate(rows):
cells = row.find_all('td')
if cells and "NAME OF ISSUER" in cells[0].text.upper():
del rows[:index+1]
break
# convert remaining html rows to dict to create dataframe
positions = []
for position in rows:
dic = {}
cells = position.find_all('td')
dic["NAME_OF_ISSUER"] = cells[0].text
dic["CUSIP"] = cells[2].text
dic["VALUE"] = int(cells[3].text.replace(',', ''))*1000
dic["SHARES"] = int(celss[4].text.replace(',', ''))
positions.append(dic)
df = pd.DataFrame(positions)

Unable to grab a text from soup

I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?
There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...
Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5(+1.33%)</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].
Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1

Parsing webpage in Python using Beautiful Soup

This is the Web page Source code which I am scraping using Beautiful Soup.
<tr>
<td>
1
</td>
<td style="cipher1">
<img class="cipher2" src="http://cipher3.png" alt="cipher4" title="cipher5" />
<span class="cipher8">t</span>cipher9
</td>
<td>
112
</td>
<td>
3510
</td>
// Pattern Repeated
<tr >
<td>
2
</td>
<td style="cipher1">
I wrote some code using BeautifulSoup but I am getting more results than I want due to multiple occurrences of the pattern.
I have used
row1 = soup.find_all('a' ,class_ = "cipher7" )
for row in row1:
f.write( row['title'] + "\n")
But with this I get multiple occurences for 'cipher7' since it is occurring multiple times in the web page.
So the thing I can use this
<td style="cipher1">...
since it is unique to the things which I want.
So, How to modify my code to do this?
You can use a convenient select method which takes a CSS selector as an argument:
row = soup.select("td[style=cipher1] > a.cipher7")
You can first find the td tag (since you said it is unique) and then find the specified atag from it.
all_as = []
rows = soup.find_all('td', {'style':'cipher1'})
for row in rows:
all_as.append(row.find_all('a', class_ = "cipher7"))

Categories