I've got a string which is like that:
<span class=\"market_listing_price market_listing_price_with_fee\">\r
\t\t\t\t\t$92.53 USD\t\t\t\t<\/span>
I need to find this string via RegEx. My try:
(^<span class=\\"market_listing_price market_listing_price_with_fee\\">\\r\\t\\t\\t\\t\\t&)
But my problem is, the count of "\t" and "\r" may vary.. And of course this is not the Regular Expression for the whole string.. Only for a part of it.
So, what's the correct and full RegEx for this string?
Answering your question about the Regex:
"market_listing_price market_listing_price_with_fee\\">[\\r]*[\\t]*&
This will catch the string you need. Even if you add more \t's or \r's.
If you need to edit this Regex I advice you to visit this website and test-modify it. It will also help you to understand how regular expression works and build your own complete RegEx.
Since this is an HTML string, I would suggest using an HTML Parser like BeautifulSoup.
Here is an example approach finding the element by class attribute value using a CSS selector:
from bs4 import BeautifulSoup
data = "my HTML data"
soup = BeautifulSoup(data)
result = soup.select("span.market_listing_price.market_listing_price_with_fee")
See also:
RegEx match open tags except XHTML self-contained tags
Related
I have the following code that checks if there is a formular content in an email body but I did not understand what does this string '<\s?\/?\s?form\s?>' means and if there is another method to check formular content existence in an email?
This is the code I wrote:
class HTMLFormFinder(FeatureFinder):
def getFeature(self, message):
import re
super(HTMLFormFinder, self).getFeature(message)
payload = utils.getpayload(message).lower()
return re.compile(r'<\s?\/?\s?form\s?>', re.IGNORECASE).search(payload)!= None
Thanks in advance.
It's what's called a regular expression. It's a way to match strings that follow a particular pattern.
https://docs.python.org/3.7/library/re.html
Here r'<\s?\/?\s?form\s?>' describes a <form> HTML tag with several fallbacks in case of bad/malformed html, specifically it handles whitespaces that may appear beside the tag name form.
A better way of checking the presence of forms is to use an XML/HTML parser, like ElementTree, BeautifulSoup, because they handle bad/incorrect HTML much better than regular expressions ever can. But if you want to keep it simple, the regex you have should suffice.
https://docs.python.org/3.7/library/xml.etree.elementtree.html
https://www.crummy.com/software/BeautifulSoup/bs4/doc/
Using BeautifulSoup you can do:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
has_form = len(soup.find('form')) > 0
You can read more on regular expressions here:
https://docs.python.org/2/library/re.html
Specifically \s matches any whitespace charachter.
I'm trying to match a HTML text mixed with some normal strings .
I've already do most of the job , but the problem with the string inside the HTML chars .
So the text i'm trying to find would look like this :
>(\n(optional))</td>\n<td style="text-align:right">Text i want</td>\n
So the main problem is the optional part because it has \n () and string , and all of it are optional .
what i've done so far is :
reg_num = r'></td>\\n<td style="text-align:right">.*?</td>\\n'
reg_num1 = r'(?<="\>).*?(?=\</)'
pattern = re.compile(reg_name)
pattern1 = re.compile(reg_num)
pattern2 = re.compile(reg_num1)
pup = re.findall(pattern1, str(html_text))
new_pup = re.findall(pattern2,str(pup))
What i did above is first found the text and then found the text i want .
this code works fine for all the result which doesn't have the optional text within.
What should i add in order to get the matches when there is optional text too ?
Is there any better way to find the text with one line without dividing it ?
You should not use a regex to parse HTML, you should use a tool like XPath queries or css/jQuery selectors.
A package that allows you to parse HTML is BeautifulSoup. For example:
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(html_text))
for td_tag in soup.find_all('td',{'style':'text-align:right'}):
print(td_tag.text) #or do something else with the text
Here you parse it to a soup object, and the you iterate over all <td> tags that have an attribute style that is exactly "text-align:right". Now for all these td_tags, you print the .text (evidently you can do something else with it).
If you for instance want to construct a list of all these texts, you can use list comprehension:
from bs4 import BeautifulSoup
soup = BeautifulSoup(str(html_text))
all_texts = [td_tag.text for td_tag in soup.find_all('td',{'style':'text-align:right'})]
As you can see, here you specify what you want to extract, there is no need to write complex regexes that can easily fail or even are impossible to construct. One can easily read what you aim to extract.
I would recommend you to use beautifulsoup package of Python.
This question already has answers here:
regular expression to extract text from HTML
(11 answers)
Closed 10 years ago.
how do i extract everythin that is not an html tag from a partial html text?
That is, if I have something of the type:
<div>Hello</div><h3><div>world</div></h3>
I want to extract ['Hello','world']
I thought about the Regex:
>[a-zA-Z0-9]+<
but it will not include special characters and chinese or hebrew characters, which I need
You should look at something like regular expression to extract text from HTML
From that post:
You can't really parse HTML with regular expressions. It's too
complex. RE's won't handle will work in
a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser.
Python folks often use something Beautiful Soup to parse HTML and
strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often
find yourself trying to parse HTML which is clearly improper, but
happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is
patience and hard work. But it's often simpler to use someone else's
parser.
As Avi already pointed, this is too complex task for regular expressions. Use get_text from BeautifulSoup or clean_html from nltk to extract text from your html.
from bs4 import BeautifulSoup
clean_text = BeautifulSoup(html).get_text()
or
import nltk
clean_text = nltk.clean_html(html)
Another option, thanks to GuillaumeA, is to use pyquery:
from pyquery import PyQuery
clean_text = PyQuery(html)
It must be said that the above mentioned html parsers will do the job with varying level of success if the html is not well formed, so you should experiment and see what works best for your input data.
I am not familiar with Python , but the following regular expression can help you.
<\s*(\w+)[^/>]*>
where,
<: starting character
\s*: it may have whitespaces before tag name (ugly but possible).
(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.
[^/>]*: anything except > and / until closing >
\>: closing >
I've seen other questions which will parse either all plain links, or all anchor tags from a string, but nothing that does both.
Ideally, the regular expression will be able to parse a string like this (I'm using Python):
>>> import re
>>> content = '
http://www.google.com Some other text.
And even more text! http://stackoverflow.com
'
>>> links = re.findall('some-regular-expression', content)
>>> print links
[u'http://www.google.com', u'http://stackoverflow.com']
Is it possible to produce a regular expression which would not result in duplicate links being returned? Is there a better way to do this?
No matter what you do, it's going to be messy. Nevertheless, a 90% solution might resemble:
r'<a\s[^>]*>([^<]*)</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Since that pattern has two groups, it will return a list of 2-tuples; to join them, you could use a list comprehension or even a map:
map(''.join, re.findall(pattern, content))
If you want the src attribute of the anchor instead of the link text, the pattern gets even messier:
r'<a\s[^>]*src=[\'"]([^"\']*)[\'"][^>]*>[^<]*</a>|\b(\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()])'
Alternatively, you can just let the second half of the pattern pick up the src attribute, which also alleviates the need for the string join:
r'\b\w+://[^<>\'"\t\r\n\xc2\xa0]*[^<>\'"\t\r\n\xc2\xa0 .,()]'
Once you have this much in place, you can replace any found links with something that doesn't look like a link, search for '://', and update the pattern to collect what it missed. You may also have to clean up false positives, particularly garbage at the end. (This pattern had to find links that included spaces, in plain text, so it's particularly prone to excess greediness.)
Warning: Do not rely on this for future user input, particularly when security is on the line. It is best used only for manually collecting links from existing data.
Usually you should never parse HTML with regular expressions since HTML isn't a regular language. Here it seems you only want to get all the http-links either they are in an A element or in text. How about getting them all and then remove the duplicates?
Try something like
set(re.findall("(http:\/\/.*?)[\"' <]", content))
and see if it serves your purpose.
Writing a regex pattern that matches all valid url is tricky business.
If all you're looking for is to detect simple http/https URLs within an arbitrary string, I could offer you this solution:
>>> import re
>>> content = 'http://www.google.com Some other text. And even more text! http://stackoverflow.com'
>>> re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)
['http://www.google.com', 'http://www.google.com', 'http://stackoverflow.com']
That looks for strings that start with http:// or https:// followed by one or more valid chars.
To avoid duplicate entries, use set():
>>> list(set(re.findall(r"https?://[\w\-.~/?:#\[\]#!$&'()*+,;=]+", content)))
['http://www.google.com', 'http://stackoverflow.com']
You should not use regular expressions to extract things from HTML. You should use an HTML parser.
If you also want to extract things from the text of the page then you should do that separately.
Here's how you would do it with lxml:
# -*- coding: utf8 -*-
import lxml.html as lh
import re
html = """
is.gd/testhttp://www.google.com Some other text.
And even more text! http://stackoverflow.com
here's a url bit.ly/test
"""
tree = lh.fromstring(html)
urls = set([])
for a in tree.xpath('//a'):
urls.add(a.text)
for text in tree.xpath('//text()'):
for url in re.findall(r'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?«»“”‘’]))', text):
urls.add(url[0])
print urls
Result:
set(['http://www.google.com', 'bit.ly/test', 'http://stackoverflow.com', 'is.gd/test'])
URL matchine regex from here: http://daringfireball.net/2010/07/improved_regex_for_matching_urls
No, it will not be able to parse string like this. Regexp are capable of simple matching and you can't handle parsing a complicated grammar as html just with one or two regexps.
How to get a value of nested <b> HTML tag in Python using regular expressions?
<b>LG</b> X110
# => LG X110
You don't.
Regular Expressions are not well suited to deal with the nested structure of HTML. Use an HTML parser instead.
Don't use regular expressions for parsing HTML. Use an HTML parser like BeautifulSoup. Just look how easy it is:
from BeautifulSoup import BeautifulSoup
html = r'<b>LG</b> X110'
soup = BeautifulSoup(html)
print ''.join(soup.findAll(text=True))
# LG X110
Your question was very hard to understand, but from the given output example it looks like you want to strip everything within < and > from the input text. That can be done like so:
import re
input_text = '<a bob>i <b>c</b></a>'
output_text = re.sub('<[^>]*>', '', input_text)
print output_text
Which gives you:
i c
If that is not what you want, please clarify.
Please note that the regular expression approach for parsing XML is very brittle. For instance, the above example would break on the input <a name="b>c">hey</a>. (> is a valid character in a attribute value: see XML specs)
Try this...
<a.*<b>(.*)</b>(.*)</a>
$1 and $2 should be what you want, or whatever means Python has for printing captured groups.
+1 for Jens's answer. lxml is a good library you can use to actually parse this in a robust fashion. If you'd prefer something in the standard library, you can use sax, dom or elementree.