Remove text from first cell HTML using python - python

I have this file:
<table>
<tr>
<td WIDTH="49%">
<p> cell to remove</p></td>
<td WIDTH="51%"> some text </td>
</tr>
I need as result this:
<table>
<tr>
<td>
</td>
<td WIDTH="51%"> some text </td>
</tr>
I am trying to read the file with this html and replace my first tag with an empty one:
ret = open('rec1.txt').read()
re.sub('<td[^/td>]+>','<td> </td>',ret, 1)
final= open('rec2.txt', 'w')
final.write(ret)
final.close()
As you can see i am new using python and something, when i read rec2.txt it contains exactly the same text of the previous file.
tks

Using regex to parse HTML is a very bad practice (see #Lutz Horn's link in the comment).
Use an HTML parser instead. For example, here's how you can set the value of the first td tag to empty using BeautifulSoup:
Beautiful Soup is a Python library for pulling data out of HTML and
XML files. It works with your favorite parser to provide idiomatic
ways of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<td WIDTH="49%">
<p> cell to remove</p>
</td>
<td WIDTH="51%">
some text
</td>
</tr>
</table>"""
soup = BeautifulSoup(data, 'html.parser')
cell = soup.table.tr.td
cell.string = ''
cell.attrs = {}
print soup.prettify(formatter='html')
prints:
<table>
<tr>
<td>
</td>
<td width="51%">
some text
</td>
</tr>
</table>
See also:
Parsing HTML in Python
Parsing HTML using Python
Hope that helps.

Using regex to parse HTML is a very bad practice. If you are actually trying to modify HTML, use an HTML parser.
If the question is academic, or you are only trying to make the limited transformation you describe in the question, here is a regex program that will do it:
#!/usr/bin/python
import re
ret = open('rec1.txt').read()
ret = re.sub('<td.*?/td>','<td> </td>',ret, 1, re.DOTALL)
final= open('rec2.txt', 'w')
final.write(ret)
final.close()
Notes:
The expression [/td] means match any one of /, t, or d in any order. Note instead how I used .* to match an arbitrary string followed by /td.
The final, optional, argument to re.sub() is a flags argument. re.DOTALL allows . to match new lines.
The ? means to perform a non-greedy search, so it will only consume one cell.
re.sub() returns the resulting string, it does not modify the string in place.

Related

python BeautifulSoup4 break for loop when tag found

I have a problem breaking a for loop when going trough a html with bs4.
I want to save a list separated with headings.
The HTML code can look something like below, however it contains more information between the desired tags:
<h2>List One</h2>
<td class="title">
<a title="Title One">This is Title One</a>
</td>
<td class="title">
<a title="Title Two">This is Title Two</a>
</td>
<h2>List Two</h2>
<td class="title">
<a title="Title Three">This is Title Three</a>
</td>
<td class="title">
<a title="Title Four">This is Title Four</a>
</td>
I would like to have the results printed like this:
List One
This is Title One
This is Title Two
List Two
This is Title Three
This is Title Four
I have come this far with my script:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('some webiste')
soup = BeautifulSoup(html, "lxml")
quote1 = soup.h2
print quote1.text
quote2 = quote1.find_next_sibling('h2')
print quote2.text
for quotes in soup.findAll('h2'):
if quotes.find(text=True) == quote2.text:
break
if quotes.find(text=True) == quote1.text:
for anchor in soup.findAll('td', {'class':'title'}):
print anchor.text
print quotes.text
I have tried to break the loop when "quote2" (List Two) is found. But the script gets all the td-content and ignoring the next h2-tags.
So how do I break the for loop with next h2-tag?
In my opinion the problem lies in your HTML syntax. According to https://validator.w3.org it's not legal to mix "td" and "h3" (or generally any header tag). Also, implementing list with tables is most likely not a good practice.
If you can manipulate your input files, the list you seem to need could be implemented with "ul" and "li" tags (first 'li' in 'ul' containing the header) or, if you need to use tables, just put your header inside of "td" tag, or even more cleanly with "th"s:
<table>
<tr>
<th>Your title</th>
</tr>
<tr>
<td>Your data</td>
</tr>
</table>
If the input is not under your control, your script could perform search and replace on the input text anyway, putting the headers into table cells or list items.

How to get the text from a cell after <br/> tag?

I'm crawling through a simple, but long HTML chunk, which is similar to this:
<table>
<tbody>
<tr>
<td> Some text </td>
<td> Some text </td>
</tr>
<tr>
<td> Some text
<br/>
Some more text
</td>
</tr>
</tbody>
</table>
I'm collecting the data with following little python code (using lxml):
for element in root.iter():
if element == 'td':
print element.text
Some of the texts are divided into two rows, but mostly they fit in a single row. The problem is within the divided rows.
The root element is the 'table' tag. That little code can print out all the other texts, but not what comes after the 'br' tags. If I don't exclude non-td tags, the code tries to print possible text from inside the 'br' tags, but of course there's nothing in there and thus this prints just empty new line.
However after this 'br', the code moves to the next tag on the line within the iteration, but ignores that data that's still inside the previous 'td' tag.
How can I get also the data after those tags?
Edit: It seems that some of the 'br' tags are self closing, but some are left open
<td>
Some text
<br>
Some more text
</td>
The element.tail method, suggested in the first answer, does not seem to be able to get the data after that open tag.
Edit2: Actually it works. Was my own mistake. Forgot to mention that the "print element.text" part was encapsulated by try-except, which in case of the br tag caught an AttributeError, because there's nothing inside the br tags. I had set the exception to just pass and print out nothing. Inside the same try-except I tried also print out the tail, but printing out the tail was never reached, because of the exception that happened before it.
Because <br/> is a self-closing tag, it does not have any text content. Instead, you need to access it's tail content. The tail content is the content after the element's closing tag, but before the next opening tag. To access this content in your for loop you will need to use the following:
for element in root.iter():
element_text = element.text
element_tail = element.tail
Even if the br tag is an opening tag, this method will still work:
from lxml import etree
content = '''
<table>
<tbody>
<tr>
<td> Some text </td>
<td> Some text </td>
</tr>
<tr>
<td> Some text
<br>
Some more text
</td>
</tr>
</tbody>
</table>
'''
root = etree.HTML(content)
for element in root.iter():
print(element.tail)
Output
Some more text
To me below is working to extract all the text after br-
normalize-space(//table//br/following::text()[1])
Working example is at.
You can target the br element and use . get(index) to fetch the underlying DOM element, the use nextSibling to target the text node. Then nodeValue property can be used to get the text.

python lxml xpath returning escape characters in list with text

Before last week, my experience with Python had been very limited to large database files on our network, and suddenly I am thrust into the world of trying to extract information from html tables.
After a lot of reading, I chose to use lxml and xpath with Python 2.7 to retrieve the data in question. I have retrieved one field using the following code:
xpath = "//table[#id='resultsTbl1']/tr[position()>1]/td[#id='row_0_partNumber']/child::text()"
which produced the following list:
['\r\n\t\tBAR18FILM/BKN', '\r\n\t\t\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t', '\r\n\t\t\t\r\n\t\t']
I recognized the CR/LF and tab escape characters, I was wondering how to avoid them?
Those characters are part of the XML document, which is why they are being returned. You can't avoid them, but you can strip them out. You could call the .strip() method on each item returned:
results = [x.strip() for x in results]
This would strip leading and trailing whitespace. Without seeing your actual code and data it's harder to give a good answer.
For example, given this script:
#!/usr/bin/python
from lxml import etree
with open('data.xml') as fd:
doc = etree.parse(fd)
results = doc.xpath(
"//table[#id='results']/tr[position()>1]/td/child::text()")
print 'Before stripping'
print repr(results)
print 'After stripping'
results = [x.strip() for x in results]
print repr(results)
And this data:
<doc>
<table id="results">
<tr>
<th>ID</th><th>Name</th><th>Description</th>
</tr>
<tr>
<td>
1
</td>
<td>
Bob
</td>
<td>
A person
</td>
</tr>
<tr>
<td>
2
</td>
<td>
Alice
</td>
<td>
Another person
</td>
</tr>
</table>
</doc>
We get these results:
Before stripping
['\n\t\t\t1\n\t\t\t', '\n\t\t\tBob\n\t\t\t', '\n\t\t\tA person\n\t\t\t', '\n\t\t\t2\n\t\t\t', '\n\t\t\tAlice\n\t\t\t', '\n\t\t\tAnother person\n\t\t\t']
After stripping
['1', 'Bob', 'A person', '2', 'Alice', 'Another person']

Beautiful soup question

I want to fetch specific rows in an HTML document
The rows have the following attributes set: bgcolor and vallign
Here is a snippet of the HTML table:
<table>
<tbody>
<tr bgcolor="#f01234" valign="top">
<!--- td's follow ... -->
</tr>
<tr bgcolor="#c01234" valign="top">
<!--- td's follow ... -->
</tr>
</tbody>
</table>
I've had a very quick look at BS's documentation. Its not clear what params to pass to findAll to match the rows I want.
Does anyone know what tp bass to findAll() to match the rows I want?
Don't use regex to parse html. Use a html parser
import lxml.html
doc = lxml.html.fromstring(your_html)
result = doc.xpath("//tr[(#bgcolor='#f01234' or #bgcolor='#c01234') "
"and #valign='top']")
print result
That will extract all tr elements that match from your html, you can do further operation with them like change text, attribute value, extract, search further...
Obligatory link:
RegEx match open tags except XHTML self-contained tags
Something like
soup.findAll('tr', attrs={'bgcolor':
re.compile(r'#f01234|#c01234'),
'valign': 'top'})

Can't parse a table child with xpath

I'm parsing a site with some messy html, they're 130 subsites and the only one that fails is the last one. The part in which fails is the bolded one. I get an empty list when I should be getting 3(parent and 2 childs). All sites have the same structure so I don't have a clue how to solve this.
from lxml.html import parse
# get a list of the urls of the foods to parse
main_site = "http://www.whfoods.com/foodstoc.php"
doc = parse(main_site).getroot()
doc.make_links_absolute()
sites = doc.xpath('/html/body//div[#class="full3col"]/ul/li/a/#href')
for site in sites:
doc = parse(site).getroot()
**table = doc.xpath("descendant::table[1]")[0]**
#food info list
table.xpath("//tr/td/table/tr/td/b/text()")
# food nutrients list
table.xpath("//tr/td/table[1]/tr/td/text()")
This is an html excerpt of the site that fails( click here if you want to see it complete):
<html>
<head>
<body>
<div id=mainpage">
<div id="subcontent">
(40+ <p> tags with things inside)
<p>
<table>
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<b>Food's name<br>other things</b>
</td>
</tr>
<tr>
Heads of the table(not needed)
</tr>
<tr>
<td>nutrient name</td>
<td>dv</td>
<td>density</td>
<td>rating</td>
</tr>
</tbody>
</table>
<table> Not needed
...
All remaining closing tags
According to validator.w3.org when pointed at http://www.whfoods.com/genpage.php?tname=foodspice&dbid=97:
Line 253, column 147: non SGML character number 150
…ed mushrooms by Liquid Chromatography Mass Spectroscopy. The 230th ACS Natio…
The problem character is between "Chromatography" and "Mass". The page is declared to be encoded in ISO-8859-1, but as often happens in that case, it is lying:
>>> import unicodedata as ucd
>>> ucd.name(chr(150).decode('cp1252'))
'EN DASH'
Perhaps lxml is being picky about this also (Firefox doesn't care).

Categories