Python Regular Expressions - extract every table cell content [duplicate]

Python Regular Expressions - extract every table cell content [duplicate] - python

This question already has answers here:
Closed 10 years ago.
Possible Duplicate:
RegEx match open tags except XHTML self-contained tags
If I have a string that looks something like...
"<tr><td>123</td><td>234</td>...<td>697</td></tr>"
Basically a table row with n cells.
What's the easiest way in python to get the values of each cell. That is I just want the values "123", "234", "697" stored in a list or array or what ever is easiest.
I've tried to use regular expressions, when I use
re.match
I am not able to get it to find anything. If I try with
re.search
I can only get the first cell. But I want to get all the cells. If I can't do this with n cells, how would you do it with a fixed number of cells?

If that markup is part of a larger set of markup, you should prefer a tool with a HTML parser.
One such tool is BeautifulSoup.
Here's one way to find what you need using that tool:
>>> markup = '''"<tr><td>123</td><td>234</td>...<td>697</td></tr>"'''
>>> from bs4 import BeautifulSoup as bs
>>> soup = bs(markup)
>>> for i in soup.find_all('td'):
... print(i.text)
Result:
123
234
697

Don't do this. Just use a proper HTML parser, and use something like xpath to get the elements you want.
A lot of people like lxml. For this task, you will probably want to use the BeautifulSoup backend, or use BeautifulSoup directly, because this is presumably not markup from a source known to generate well-formed, valid documents.

When using lxml, an element tree gets created. Each element in the element tree holds information about a tag.
from lxml import etree
root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>")
elements = root.findall(".//a")
tag = elements[0].tag
attr = elements[0].attrib

Related

BeautifulSoup, getting more returns than expected with regex

Using BeautifulSoup, I have the following line:
dimensions = SOUP.select(".specs__title > h4", text=re.compile(r'Dimensions'))
However, it's returning more than just the tags that have a text of 'Dimensions' as shown in these results:
[<h4>Dimensions</h4>, <h4>Details</h4>, <h4>Warranty / Certifications</h4>]
Am I using the regex incorrectly with the way SOUP works?

The select interface doesn't have a text keyword. Before we go further, the following is assuming you are using BeautifulSoup 4.7+.
If you'd like to filter by text, you might be able to do something like this:
dimensions = SOUP.select(".specs__title > h4:contains(Dimensions)")
More information on the :contains() pseudo-class implementation is available here: https://facelessuser.github.io/soupsieve/selectors/#:contains.
EDIT: To clarify, there is no way to incorporate regex directly into a select call currently. You would have to filter the elements after the fact to use regex. In the future there may be a way to use regex via some custom pseudo-class, but currently there is no such feature available in Soup Sieve (Beautiful Soup's select implementation in 4.7+).

Extract information from a webpage in a particular format

I am trying to make a simple python script to extract certain links from a webpage. I am able to extract link successfully but now I want to extract some more information like bitrate,size,duration given on that webpage.
I am using the below xpath to extract the above mentioned info
>>> doc = lxml.html.parse('http://mp3skull.com/mp3/linkin_park_faint.html')
>>> info = doc.xpath(".//*[#id='song_html']/div[1]/text()")
>>> info[0:7]
['\n\t\t\t', '\n\t\t\t\t3.71 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t3.49 mb\t\t\t', '\n\t\t\t', '\n\t\t\t\t192 kbps', '2:41']
Now what I need is that for a particular link the info I require is generated in a form of tuple like (bitrate,size,duration).
The xpath I mentioned above generates the required info but it is ill-formatted that is it is not possible to achieve my required format with any logic at least I am not able to that.
So, is there any way to achieve the output in my format.?

I think BeautifulSoup will do the job, it parses even badly formatted HTML:
http://www.crummy.com/software/BeautifulSoup/
parsing is quite easy with BeautifulSoup - for example:
import bs4
import urllib
soup = bs4.BeautifulSoup(urllib.urlopen('http://mp3skull.com/mp3/linkin_park_faint.html').read())
print soup.find_all('a')
and have quite good docs:
http://www.crummy.com/software/BeautifulSoup/bs4/doc/

You can actually strip everything out with XPath:
translate(.//*[#id='song_html']/div[1]/text(), "\n\t,'", '')
So for your additional question, either:
info[0, len(info)]
for altogether, or:
info.rfind(" ")
Since the translate leaves a space character, but you could replace that with whatever you wanted.
Addl info found here

How are you with regular expressions and python's re module?
http://docs.python.org/library/re.html may be essential.
As far as getting the data out of the array, re.match(regex,info[n]) should suffice, as far as the triple tuple goes, the python tuple syntax takes care of it. Simply match from members of your info array with re.match.
import re
matching_re = '.*' # this re matches whole strings, rather than what you need
incoming_value_1 = re.match(matching_re,info[1])
# etc.
var truple = (incoming_value_1, incoming_value_2, incoming_value_2

how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

How can one tell etree.strip_tags() to strip all possible tags from a given tag element?
Do I have to map them myself, like:
STRIP_TAGS = [ALL TAGS...] # Is there a built in list or dictionary in lxml
# that gives you all tags?
etree.strip_tags(tag, *STRIP_TAGS)
Perhaps a more elegant approach I don't know of?
Example input:
parent_tag = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
Desired Output:
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>
or even better:
This is some text with multiple tags and sometimes they are nested.

You can use the lxml.html.clean module:
import lxml.html, lxml.html.clean
s = '<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
tree = lxml.html.fromstring(s)
cleaner = lxml.html.clean.Cleaner(allow_tags=['parent'], remove_unknown_tags=False)
cleaned_tree = cleaner.clean_html(tree)
print lxml.etree.tostring(cleaned_tree)
# <parent>This is some text with multiple tags and sometimes they are nested.</parent>

This answer is a bit late, but I guess a simpler solution than the one provided by the initial answer by ars might be handy for safekeeping's sake.
Short Answer
Use the "*" argument when you call strip_tags() to specify all tags to be stripped.
Long Answer
Given your XML string, we can create an lxml Element:
>>> import lxml.etree
>>> s = "<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>"
>>> parent_tag = lxml.etree.fromstring(s)
You can inspect that instance like so:
>>> parent_tag
<Element parent at 0x5f9b70>
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some <i>text</i> with multiple <some_tag>tags</some_tag> and sometimes they <tt>are<bold> nested</bold></tt>.</parent>'
To strip out all the tags except the parent tag itself, use the etree.strip_tags() function like you suggested, but with a "*" argument:
>>> lxml.etree.strip_tags(parent_tag, "*")
Inspection shows that all child tags are gone:
>>> lxml.etree.tostring(parent_tag)
b'<parent>This is some text with multiple tags and sometimes they are nested.</parent>'
Which is your desired output. Note that this will modify the lxml Element instance itself! To make it even better (as you asked :-)) just grab the text property:
>>> parent_tag.text
'This is some text with multiple tags and sometimes they are nested.'

How to get all strings from all nested tags of a xml tag with python's lxml.etree library?

I have an xml file in which it is possible that the following occurs:
...
<a><b>This is</b> some text about <c>some</c> issue I have, parsing xml</a>
...
Edit: Let's assume, the tags could be nested more than only level, meaning
<a><b><c>...</c>...</b>...</a>
I came up with this using the python lxml.etree library.
context = etree.iterparse(PATH_TO_XML, dtd_validation=True, events=("end",))
for event, element in context:
tag = element.tag
if tag == "a":
print element.text # is empty :/
mystring = element.xpath("string()")
...
But somehow it goes wrong.
What I want is the whole string
"This is some text about some issue I have, parsing xml"
But I only get an empty string. Any suggestions? Thanks!

This question has been asked many times.
You can use lxml.html.text_content() method.
import lxml.html
t = lxml.html.fromstring("...")
t.text_content()
REF: Filter out HTML tags and resolve entities in python
OR use lxml.etree.strip_tags() method.
REF: In lxml, how do I remove a tag but retain all contents?

Need python lxml syntax help for parsing html

I am brand new to python, and I need some help with the syntax for finding and iterating through html tags using lxml. Here are the use-cases I am dealing with:
HTML file is fairly well formed (but not perfect). Has multiple tables on screen, one containing a set of search results, and one each for a header and footer. Each result row contains a link for the search result detail.
I need to find the middle table with the search result rows (this one I was able to figure out):
self.mySearchTables = self.mySearchTree.findall(".//table")
self.myResultRows = self.mySearchTables[1].findall(".//tr")
I need to find the links contained in this table (this is where I'm getting stuck):
for searchRow in self.myResultRows:
searchLink = patentRow.findall(".//a")
It doesn't seem to actually locate the link elements.
I need the plain text of the link. I imagine it would be something like searchLink.text if I actually got the link elements in the first place.
Finally, in the actual API reference for lxml, I wasn't able to find information on the find and the findall calls. I gleaned these from bits of code I found on google. Am I missing something about how to effectively find and iterate over HTML tags using lxml?

Okay, first, in regards to parsing the HTML: if you follow the recommendation of zweiterlinde and S.Lott at least use the version of beautifulsoup included with lxml. That way you will also reap the benefit of a nice xpath or css selector interface.
However, I personally prefer Ian Bicking's HTML parser included in lxml.
Secondly, .find() and .findall() come from lxml trying to be compatible with ElementTree, and those two methods are described in XPath Support in ElementTree.
Those two functions are fairly easy to use but they are very limited XPath. I recommend trying to use either the full lxml xpath() method or, if you are already familiar with CSS, using the cssselect() method.
Here are some examples, with an HTML string parsed like this:
from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)
Using the css selector class your program would roughly look something like this:
# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))
The equivalent using xpath method would be:
# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

Is there a reason you're not using Beautiful Soup for this project? It will make dealing with imperfectly formed documents much easier.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python Regular Expressions - extract every table cell content [duplicate] - python

When using lxml, an element tree gets created. Each element in the element tree holds information about a tag. from lxml import etree root = etree.XML("<root><a x='123'>aText<b/><c/><b/></a></root>") elements = root.findall(".//a") tag = elements[0].tag attr = elements[0].attrib

Related

BeautifulSoup, getting more returns than expected with regex

Extract information from a webpage in a particular format

how to strip all child tags in an xml tag but leaving the text to merge to the parens using lxml in python?

How to get all strings from all nested tags of a xml tag with python's lxml.etree library?

Need python lxml syntax help for parsing html

Categories

Resources