Parse info from tables with Scrapy and XPath

Parse info from tables with Scrapy and XPath - python

I am trying to extract attributes from a website with scrapy and xpath:
response.xpath('//section[#id="attributes"]/div/table/tbody/tr/td/text()').extract()
The attributes are nested in the following way:
<section id="attributes">
<h5>Attributes</h5>
<div>
<table>
<tbody>
<tr>
<td>Attribute 1</td>
<td>Value 1</td>
</tr>
<tr>
<td>Attriburte 2</td>
<td>Value 2</td>
</tr>
There are two problems associated with this:
Get the content of the td elements (the XPath command will return[])
Once the td is retrieved, I need to get the pairing somehow. e.g.: "Attribute 1" = "Value 1"
I am new to phyton and scrapy, any help is greatly appreciated.

First of all you should try to remove tbody tag from XPath as usually it's not in page source.
You can update your code as below:
cells = response.xpath('//section[#id="attributes"]/div/table//tr/td/text()').extract()
att_values = [{first: second} for first, second in zip(cells[::2], cells[1::2])]
You will get list of attribute-value pairs:
[{attr_1: value_1}, {attr_2: value_2}, {attr_3: value_3}, ...]
or
att_values = {first: second for first, second in zip(cells[::2], cells[1::2])}
# or:
# att_values = dict( zip(cells[::2], cells[1::2]) )
to get dictionary
{attr_1: value_1, attr_2: value_2, attr_3: value_3, ...}

Try:
for row in response.css('section#attributes table tr'):
td1 = row.xpath('.//td[1]/text()').get()
td2 = row.xpath('.//td[2]/text()').get()
# your logic further

Related

Get contents of a table cell without removing tags inside

Beginner with Python here, I'm trying to create a script to retrieve data from a table and organize it in a dictionary.
The HTML structure looks like this :
[...previous code]
<table class="waffle" cellspacing="0" cellpadding="0">
<tbody>
<tr style='height:39px;'>
<td class="s0" dir="ltr">Gold</td>
<td class="freezebar-cell"></td>
<td class="s1" dir="ltr">Johnny <span style="bold">M.</span></td>
</tr>
<tr style='height:39px;'>
<td class="s0" dir="ltr">Silver</td>
<td class="freezebar-cell"></td>
<td class="s1" dir="ltr">Maria <span style="bold">R.</span></td>
</tr>
[rest of the code...]
My current script looks like this :
from bs4 import BeautifulSoup
itemTypeList = [] # Create list of item types
itemContentList = [] # Create list of item contents
soup = BeautifulSoup(open("test/myfile.html"), "lxml") # Open the file
table_body = soup.find("tbody") # Find the table
rows = table_body.find_all("tr") # Find the rows
for row in rows: # For each row
itemType = row.find_all("td")[0].text # Define the first cell as item type
itemContent = row.find_all("td")[2] # Define the third cell as item content
itemTypeList.append(itemType) # Add item type to the item types list
itemContentList.append(itemContent) # Add item content to the item contents list
mailContent = {itemTypeList[i]: itemContentList[i] for i in range(len(itemTypeList))} # Create a dictionary with type and content for each item
Here's what I get with this script :
['Gold': <td class="s1" dir="ltr">Johnny <span style="bold">M.</span></td>, 'Silver': <td class="s1" dir="ltr">Maria <span style="bold">R.</span></td>]
I'd like to remove the <td></td> tag around my itemContent items but I can't use ".text" like I do on itemType because I need to keep the <span style="bold"> tag to reuse it later in my code.
What's the best workaround to do this? I've been searching for the past three hours with no luck. Apparently .unwrap() could be useful there but when I add it to my code I get an error.
Thanks for reading this far!
Julien

You can get the innerHTML using element.decode_contents().
for row in rows: # For each row
itemType = row.find_all("td")[0].text # Define the first cell as item type
itemContent = row.find_all("td")[2] # Define the third cell as item content
itemTypeList.append(itemType) # Add item type to the item types list
itemContentList.append(itemContent.decode_contents()) # Add item content to the item contents list
mailContent = {itemTypeList[i]: itemContentList[i] for i in range(len(itemTypeList))} # Create a dictionary with type and content for each item
Output:
{'Gold': 'Johnny <span style="bold">M.</span>', 'Silver': 'Maria <span style="bold">R.</span>'}

you can also use row.contents
mailContent = []
for row in rows:
itemType = row.contents[1].text
itemContent = row.contents[5]
mailContent.append({
itemType : "{} {}".format(itemContent.text, itemContent.span)
})
print(mailContent)

Unable to grab a text from soup

I have an HTML as follows:
<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5（+1.33%）</span></td>
</tr>
</table>
I tried to extract 191.1 from a line containing td class="stoksPrice">191.1</td>.
soup = BeautifulSoup(html)
res = soup.find_all('stoksPrice')
print (res)
But result is [].
How to find it guys?

There seem to be two issues:
First is that your usage of find_all is invalid. The current way you're searching for a tagname called stoksPrice which is wrong ad your tags are table, tr, td, div, span. You need to change that to:
>>> res = soup.find_all(class_='stoksPrice')
to search for tags with that class.
Second, your HTML is malformed. The list with stoksPrice is:
</td>
td class="stoksPrice">191.1</td>
it should have been:
</td>
<td class)="stoksPrice">191.1</td>
(Note that < before the td)
Not sure if that was a copy error into Stack Overflow or the HTML is originally malformed but that is not going to be easy to parse ...

Since there are multiple tags having the same class, you can use CSS selectors to get an exact match.
html = '''<table class="stocksTable" summary="株価詳細">
<tr>
<th class="symbol"><h1>(株)みずほフィナンシャルグループ</h1></th>
<td class="stoksPrice realTimChange">
<div class="realTimChangeMod">
</div>
</td>
<td class="stoksPrice">191.1</td>
<td class="change"><span class="yjSt">前日比</span><span class="icoUpGreen yjMSt">+2.5（+1.33%）</span></td>
</tr>
</table>'''
soup = BeautifulSoup(html, 'lxml')
print(soup.select_one('td[class="stoksPrice"]').text)
# 191.1
Or, you could use lambda and find to get the same.
print(soup.find(lambda t: t.name == 'td' and t['class'] == ['stoksPrice']).text)
# 191.1
Note: BeautifulSoup converts multi-valued class attributes in lists. So, the classes of the two td tags look like ['stoksPrice'] and ['stoksPrice', 'realTimChange'].

Here is one way to do it using findAll.
Because all the previous stoksPrice are empty the only one that remains is the one with the price..
You can put in a check using try/except clause to check if it is a floating point number.
If it is not it will continue iterating and if it is it will return it.
res = soup.findAll("td", {"class": "stoksPrice"})
for r in res:
try:
t = float(r.text)
print(t)
except:
pass
191.1

Extract data from an HTML tag which has specific attributes

<tr bgcolor="#FFFFFF">
<td class="tablecontent" scope="row" rowspan="1">
ZURICH AMERICAN INSURANCE COMPANY
</td>
<td class="tablecontent" scope="row" rowspan="1">
FARMERS GROUP INC (14523)
</td>
<td class="tablecontent" scope="row">
znaf
</td>
<td class="tablecontent" scope="row">
anhm
</td>
</tr>
I have an HTML document which contains multiple tr tags. I want to extract the href link from the first td and data from third td tag onwards under every tr tag. How can this be achieved?

You can find all tr elements, iterate over them, then do the context-specific searches for the inner td elements and get the first and the third:
for tr in soup.find_all('tr'):
cells = tr.find_all('td')
if len(cells) < 3:
continue # safety pillow
link = cells[0].a['href'] # assuming every first td has an "a" element
data = cells[2].get_text()
print(link, data)
As a side note and depending what you are trying to accomplish in the HTML parsing, I usually find pandas.read_html() a great and convenient way to parse HTML tables into dataframes and work with the dataframes after, which are quite convenient data structures to work with.

You can use css selector nth-of-type to navigate through the tds
Here's a sample"
soup = BeautifulSoup(html, 'html.parser')
a = soup.select('td:nth-of-type(1) a')[0]
href = a['href']
td = soup.select("td:nth-of-type(3)")[0]
text = td.get_text(strip=True)

Parsing webpage in Python using Beautiful Soup

This is the Web page Source code which I am scraping using Beautiful Soup.
<tr>
<td>
1
</td>
<td style="cipher1">
<img class="cipher2" src="http://cipher3.png" alt="cipher4" title="cipher5" />
<span class="cipher8">t</span>cipher9
</td>
<td>
112
</td>
<td>
3510
</td>
// Pattern Repeated
<tr >
<td>
2
</td>
<td style="cipher1">
I wrote some code using BeautifulSoup but I am getting more results than I want due to multiple occurrences of the pattern.
I have used
row1 = soup.find_all('a' ,class_ = "cipher7" )
for row in row1:
f.write( row['title'] + "\n")
But with this I get multiple occurences for 'cipher7' since it is occurring multiple times in the web page.
So the thing I can use this
<td style="cipher1">...
since it is unique to the things which I want.
So, How to modify my code to do this?

You can use a convenient select method which takes a CSS selector as an argument:
row = soup.select("td[style=cipher1] > a.cipher7")

You can first find the td tag (since you said it is unique) and then find the specified atag from it.
all_as = []
rows = soup.find_all('td', {'style':'cipher1'})
for row in rows:
all_as.append(row.find_all('a', class_ = "cipher7"))

Parsing HTML table with LXML in Python

I need to parse html table of the following structure:
<table class="table1" width="620" cellspacing="0" cellpadding="0" border="0">
<tbody>
<tr width="620">
<th width="620">Smth1</th>
...
</tr>
<tr bgcolor="ffffff" width="620">
<td width="620">Smth2</td>
...
</tr>
<tr bgcolor="E4E4E4" width="620">
<td width="620">Smth3</td>
...
</tr>
<tr bgcolor="ffffff" width="620">
<td width="620">Smth4</td>
...
</tr>
</tbody>
</table>
Python code:
r = requests.post(url,data)
html = lxml.html.document_fromstring(r.text)
rows = html.xpath(xpath1)[0].findall("tr")
#Getting Xpath with FireBug
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
But I get this on the third line:
IndexError: list index out of range
The task is to form python dict from this. Number of rows could be different.
UPD.
Changed the way I'm getting html code to avoid possible problems with requests lib. Now it's a simple url:
html = lxml.html.parse(test_url)
This proves everyting is Ok with html:
lxml.html.open_in_browser(html)
But still the same problem:
rows = html.xpath(xpath1)[0].findall('tr')
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
Here is the xpath1:
'/html/body/table/tbody/tr[5]/td/table/tbody/tr/td[2]/table/tbody/tr/td/center/table'
UPD2. It was found experimentally, that xpath crashes on:
xpath1 = '/html/body/table/tbody'
print html.xpath(xpath1)
#print returns []
If xpath1 is shorter, then it seeem to work well and returns [<Element table at 0x2cbadb0>] for xpath1 = '/html/body/table'

You didn't include the XPath, so I'm not sure what you're trying to do, but if I understood correctly, this should work
xpath1 = "tbody/tr"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
rows = html.xpath(xpath1)
data = list()
for row in rows:
data.append([c.text for c in row.getchildren()])
This is making a list of one item lists though, like this:
[['Smth1'], ['Smth2'], ['Smth3'], ['Smth4']]
To have a simple list of the values, you can use this code
xpath1 = "tbody/tr/*/text()"
r = requests.post(url,data)
html = lxml.html.fromstring(r.text)
data = html.xpath(xpath1)
This is all assuming that r.text is exactly what you posted up there.

Your .xpath(xpath1) XPath expression failed to find any elements. Check that expression for errors.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse info from tables with Scrapy and XPath - python

Try: for row in response.css('section#attributes table tr'): td1 = row.xpath('.//td[1]/text()').get() td2 = row.xpath('.//td[2]/text()').get() # your logic further

Related

Get contents of a table cell without removing tags inside

Unable to grab a text from soup

Extract data from an HTML tag which has specific attributes

Parsing webpage in Python using Beautiful Soup

Parsing HTML table with LXML in Python

Categories

Resources