I want to fetch specific rows in an HTML document
The rows have the following attributes set: bgcolor and vallign
Here is a snippet of the HTML table:
<table>
<tbody>
<tr bgcolor="#f01234" valign="top">
<!--- td's follow ... -->
</tr>
<tr bgcolor="#c01234" valign="top">
<!--- td's follow ... -->
</tr>
</tbody>
</table>
I've had a very quick look at BS's documentation. Its not clear what params to pass to findAll to match the rows I want.
Does anyone know what tp bass to findAll() to match the rows I want?
Don't use regex to parse html. Use a html parser
import lxml.html
doc = lxml.html.fromstring(your_html)
result = doc.xpath("//tr[(#bgcolor='#f01234' or #bgcolor='#c01234') "
"and #valign='top']")
print result
That will extract all tr elements that match from your html, you can do further operation with them like change text, attribute value, extract, search further...
Obligatory link:
RegEx match open tags except XHTML self-contained tags
Something like
soup.findAll('tr', attrs={'bgcolor':
re.compile(r'#f01234|#c01234'),
'valign': 'top'})
Related
Learning scrapy and I'm trying to use it to get some specific topics in a forum.
In the forum the infomation I need is stored like:
<tbody id="threadnumber">
<tr>
<th class="new">
<em>[topic]</em>
postname
</th>
<td class="by">
<a something to show the poster and time>**</a>
</td>
<td class="num">
<a something to show the numbers of read and replys>**</a>
</td>
<td class="by">
<a something to show the last replyer and time>**</a>
</td>
</tr>
</tbody>
<tbody id="threadnumber">#next thread
<tr>....
</tr>
</tbody>
Is there any method to get the postname in the second a tag for a specific topic whose unique topicid is stored in the first a tag. Should I use sibling?
For example I get
[NEWS]
news1
[NEWS]
news2
[NEWS]
news3
[PIC]
picture1
for input.
And I want to get an output only include "NEWS" topic like['news1','news2','news3']
Thanks for your help!
You can use BeautifulSoup to find all tags with class="post". Then for each tag, you search a <a> tag in a descendant from its parent, and test whether its text is the topic you are interested in. If true, you add the postname to a result list. Code could be:
def findposts(soup, topic):
'''Finds all postname associated to topic in a BeautifulSoup element'''
posts = [] # initialize an empty result list
# search postnames by class
for postname in soup.findAll('a', attrs = {'class': 'post'}):
# find associated topic in immediate parent
if postname.findParent().find('a').text == topic:
posts.append(postname.text) # Ok add to result list
return posts
With your example data, you could do:
soup = BeautifulSoup('data', 'html.parser')
print(findpost(soup, 'topic')
and the result would be as expected:
['postname']
I have a problem breaking a for loop when going trough a html with bs4.
I want to save a list separated with headings.
The HTML code can look something like below, however it contains more information between the desired tags:
<h2>List One</h2>
<td class="title">
<a title="Title One">This is Title One</a>
</td>
<td class="title">
<a title="Title Two">This is Title Two</a>
</td>
<h2>List Two</h2>
<td class="title">
<a title="Title Three">This is Title Three</a>
</td>
<td class="title">
<a title="Title Four">This is Title Four</a>
</td>
I would like to have the results printed like this:
List One
This is Title One
This is Title Two
List Two
This is Title Three
This is Title Four
I have come this far with my script:
import urllib2
from bs4 import BeautifulSoup
html = urllib2.urlopen('some webiste')
soup = BeautifulSoup(html, "lxml")
quote1 = soup.h2
print quote1.text
quote2 = quote1.find_next_sibling('h2')
print quote2.text
for quotes in soup.findAll('h2'):
if quotes.find(text=True) == quote2.text:
break
if quotes.find(text=True) == quote1.text:
for anchor in soup.findAll('td', {'class':'title'}):
print anchor.text
print quotes.text
I have tried to break the loop when "quote2" (List Two) is found. But the script gets all the td-content and ignoring the next h2-tags.
So how do I break the for loop with next h2-tag?
In my opinion the problem lies in your HTML syntax. According to https://validator.w3.org it's not legal to mix "td" and "h3" (or generally any header tag). Also, implementing list with tables is most likely not a good practice.
If you can manipulate your input files, the list you seem to need could be implemented with "ul" and "li" tags (first 'li' in 'ul' containing the header) or, if you need to use tables, just put your header inside of "td" tag, or even more cleanly with "th"s:
<table>
<tr>
<th>Your title</th>
</tr>
<tr>
<td>Your data</td>
</tr>
</table>
If the input is not under your control, your script could perform search and replace on the input text anyway, putting the headers into table cells or list items.
I'm crawling through a simple, but long HTML chunk, which is similar to this:
<table>
<tbody>
<tr>
<td> Some text </td>
<td> Some text </td>
</tr>
<tr>
<td> Some text
<br/>
Some more text
</td>
</tr>
</tbody>
</table>
I'm collecting the data with following little python code (using lxml):
for element in root.iter():
if element == 'td':
print element.text
Some of the texts are divided into two rows, but mostly they fit in a single row. The problem is within the divided rows.
The root element is the 'table' tag. That little code can print out all the other texts, but not what comes after the 'br' tags. If I don't exclude non-td tags, the code tries to print possible text from inside the 'br' tags, but of course there's nothing in there and thus this prints just empty new line.
However after this 'br', the code moves to the next tag on the line within the iteration, but ignores that data that's still inside the previous 'td' tag.
How can I get also the data after those tags?
Edit: It seems that some of the 'br' tags are self closing, but some are left open
<td>
Some text
<br>
Some more text
</td>
The element.tail method, suggested in the first answer, does not seem to be able to get the data after that open tag.
Edit2: Actually it works. Was my own mistake. Forgot to mention that the "print element.text" part was encapsulated by try-except, which in case of the br tag caught an AttributeError, because there's nothing inside the br tags. I had set the exception to just pass and print out nothing. Inside the same try-except I tried also print out the tail, but printing out the tail was never reached, because of the exception that happened before it.
Because <br/> is a self-closing tag, it does not have any text content. Instead, you need to access it's tail content. The tail content is the content after the element's closing tag, but before the next opening tag. To access this content in your for loop you will need to use the following:
for element in root.iter():
element_text = element.text
element_tail = element.tail
Even if the br tag is an opening tag, this method will still work:
from lxml import etree
content = '''
<table>
<tbody>
<tr>
<td> Some text </td>
<td> Some text </td>
</tr>
<tr>
<td> Some text
<br>
Some more text
</td>
</tr>
</tbody>
</table>
'''
root = etree.HTML(content)
for element in root.iter():
print(element.tail)
Output
Some more text
To me below is working to extract all the text after br-
normalize-space(//table//br/following::text()[1])
Working example is at.
You can target the br element and use . get(index) to fetch the underlying DOM element, the use nextSibling to target the text node. Then nodeValue property can be used to get the text.
I have this file:
<table>
<tr>
<td WIDTH="49%">
<p> cell to remove</p></td>
<td WIDTH="51%"> some text </td>
</tr>
I need as result this:
<table>
<tr>
<td>
</td>
<td WIDTH="51%"> some text </td>
</tr>
I am trying to read the file with this html and replace my first tag with an empty one:
ret = open('rec1.txt').read()
re.sub('<td[^/td>]+>','<td> </td>',ret, 1)
final= open('rec2.txt', 'w')
final.write(ret)
final.close()
As you can see i am new using python and something, when i read rec2.txt it contains exactly the same text of the previous file.
tks
Using regex to parse HTML is a very bad practice (see #Lutz Horn's link in the comment).
Use an HTML parser instead. For example, here's how you can set the value of the first td tag to empty using BeautifulSoup:
Beautiful Soup is a Python library for pulling data out of HTML and
XML files. It works with your favorite parser to provide idiomatic
ways of navigating, searching, and modifying the parse tree. It
commonly saves programmers hours or days of work.
from bs4 import BeautifulSoup
data = """
<table>
<tr>
<td WIDTH="49%">
<p> cell to remove</p>
</td>
<td WIDTH="51%">
some text
</td>
</tr>
</table>"""
soup = BeautifulSoup(data, 'html.parser')
cell = soup.table.tr.td
cell.string = ''
cell.attrs = {}
print soup.prettify(formatter='html')
prints:
<table>
<tr>
<td>
</td>
<td width="51%">
some text
</td>
</tr>
</table>
See also:
Parsing HTML in Python
Parsing HTML using Python
Hope that helps.
Using regex to parse HTML is a very bad practice. If you are actually trying to modify HTML, use an HTML parser.
If the question is academic, or you are only trying to make the limited transformation you describe in the question, here is a regex program that will do it:
#!/usr/bin/python
import re
ret = open('rec1.txt').read()
ret = re.sub('<td.*?/td>','<td> </td>',ret, 1, re.DOTALL)
final= open('rec2.txt', 'w')
final.write(ret)
final.close()
Notes:
The expression [/td] means match any one of /, t, or d in any order. Note instead how I used .* to match an arbitrary string followed by /td.
The final, optional, argument to re.sub() is a flags argument. re.DOTALL allows . to match new lines.
The ? means to perform a non-greedy search, so it will only consume one cell.
re.sub() returns the resulting string, it does not modify the string in place.
I have many pages of HTML with various sections containing these code snippets:
<div class="footnote" id="footnote-1">
<h3>Reference:</h3>
<table cellpadding="0" cellspacing="0" class="floater" style="margin-bottom:0;" width="100%">
<tr>
<td valign="top" width="20px">
1.
</td>
<td>
<p> blah </p>
</td>
</tr>
</table>
</div>
I can parse the HTML successfully and extract these relevant tags
tags = soup.find_all(attrs={"footnote"})
Now I need to add new parent tags about these such that the code snippet goes:
<div class="footnote-out"><CODE></div>
But I can't find a way of adding parent tags in bs4 such that they brace the identified tags. insert()/insert_before add in after the identified tags.
I started by trying string manupulation:
for tags in soup.find_all(attrs={"footnote"}):
tags = BeautifulSoup("""<div class="footnote-out">"""+str(tags)+("</div>"))
but I believe this isn't the best course.
Thanks for any help. Just started using bs/bs4 but can't seem to crack this.
How about this:
def wrap(to_wrap, wrap_in):
contents = to_wrap.replace_with(wrap_in)
wrap_in.append(contents)
Simple example:
from bs4 import BeautifulSoup
soup = BeautifulSoup("<body><a>Some text</a></body>")
wrap(soup.a, soup.new_tag("b"))
print soup.body
# <body><b><a>Some text</a></b></body>
Example with your document:
for footnote in soup.find_all("div", "footnote"):
new_tag = soup.new_tag("div")
new_tag['class'] = 'footnote-out'
wrap(footnote, new_tag)