This question already has answers here:
Parsing HTML using Python
(7 answers)
Closed 3 years ago.
I have a string like this:
string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title" width="600"...
title="one more title"...> '''
I am trying to get anything that appears as title (title="Anything here")
I have already tried this but it does not work correctly.
re.findall(r'title=\"(.*)\"',string)
I think your Regex is too Greedy. You can try something like this
re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:
https://docs.python.org/3/library/html.parser.html
https://www.simplifiedpython.net/parsing-html-in-python/
https://github.com/psf/requests-html / Get html using Python requests?
If you would like to read more on performance testing of different python HTML parsers you can learn more here
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help
c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)
for i in c:
print(i.group(1))
The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.
Related
This question already has an answer here:
python regex first/shortest match
(1 answer)
Closed 6 months ago.
I am parsing some text with Python and am running into an odd issue...
an example text that is being parsed:
msg:"ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i"; reference:url,www.securityfocus.com/bid/37446/info; reference:url,doc.emergingthreats.net/2010602; classtype:web-application-attack; sid:2010602; rev:4; metadata:created_at 2010_07_30, updated_at 2010_07_30;
my regex:
msgSearch = re.search(r'msg:"(.+)";",line)
actual result:
ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i
expected result:
ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt
There are 10s of thousands of lines of text that I am parsing that are all giving me similar results. Any reason regex is picking a (seemingly) random "; to stop at? I can fix the example above by making the regex more specific, eg. r'msg:"([\w\s\.]+)";" but other lines have different characters included. I guess I could just include every special character in my regex, but I'm trying to understand why my wildcard isn't working properly.
Any help would be appreciated!
Try this one:
re.search(r'msg:"([^;]+)";',line)
The .+ is by default "greedy", i.e. it will match as many characters as possible. In your case, it will stop at the last "; sequence, not at the next one. To make it non-greedy (or lazy), try .+? :
msgSearch = re.search(r'msg:"(.+?)";",line)
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I am not very familiar with regex(s) and would like somebody to put this into something that I will be able to understand? As in, outline what each part of the regex is doing
re.compile(r'ATG((?:[ACTG]{3})+?)(?:TAG|TAA|TGA)')
So far, this is what I have come up with:
re.compile is a regex method... or something along those lines
r' is simply needed in regex
After that, I'm not too sure...
Searches for a piece in the string ATG
?:[ACTG]{3} searches for a piece of the string containing the characters A C T G within the string (does the order of these matter?) that is {3} three characters long.
+? something about going at least once, but minimal times...? What would part of code would be going at least once, but minimal times?
?: searches for TAG|TAA|TGAwithin the string. Once it finds these, what does happens?
Would I be able to do something like
key_words = "TAG TAA TGA".replace(" ", "|") so that I can have a whole long list without having to type of | a bunch of times if I have over 100 substrings?
I would then format this to something like this:
...(?:key_words)')
Examples and simple explanations always work wonders - thanks!
You can use regex101 to have it explained step by step.
This question already has answers here:
Reversing a regular expression in Python
(8 answers)
Closed 1 year ago.
I have some difficulties learning regex in python. I want to parse my tornado web route configuration along with arguments into a request path string without handlers request.path method.
For example, I have route with patterns like:
/entities/([0-9]+)
/product/([0-9]+/actions
The expected result combine with integer parameter (123) will be a string like:
/entities/123
/product/123/actions
How do I generate string based on that pattern?
Thank you very much in advance!
This might be a possible duplicate to:
Reversing a regular expression in Python
Generate a String that matches a RegEx in Python
Using the answer provided by #bjmc a solution works like this:
>>> import rstr
>>> intermediate = rstr.xeger(\d+)
>>> path = '/product/' + intermediate + '/actions'
Depending on how long you want your intermediate integer, you could replace the regex: \d{1,3}
This question already has answers here:
regular expression to extract text from HTML
(11 answers)
Closed 10 years ago.
how do i extract everythin that is not an html tag from a partial html text?
That is, if I have something of the type:
<div>Hello</div><h3><div>world</div></h3>
I want to extract ['Hello','world']
I thought about the Regex:
>[a-zA-Z0-9]+<
but it will not include special characters and chinese or hebrew characters, which I need
You should look at something like regular expression to extract text from HTML
From that post:
You can't really parse HTML with regular expressions. It's too
complex. RE's won't handle will work in
a browser as proper text, but might baffle a naive RE.
You'll be happier and more successful with a proper HTML parser.
Python folks often use something Beautiful Soup to parse HTML and
strip out tags and scripts.
Also, browsers, by design, tolerate malformed HTML. So you will often
find yourself trying to parse HTML which is clearly improper, but
happens to work okay in a browser.
You might be able to parse bad HTML with RE's. All it requires is
patience and hard work. But it's often simpler to use someone else's
parser.
As Avi already pointed, this is too complex task for regular expressions. Use get_text from BeautifulSoup or clean_html from nltk to extract text from your html.
from bs4 import BeautifulSoup
clean_text = BeautifulSoup(html).get_text()
or
import nltk
clean_text = nltk.clean_html(html)
Another option, thanks to GuillaumeA, is to use pyquery:
from pyquery import PyQuery
clean_text = PyQuery(html)
It must be said that the above mentioned html parsers will do the job with varying level of success if the html is not well formed, so you should experiment and see what works best for your input data.
I am not familiar with Python , but the following regular expression can help you.
<\s*(\w+)[^/>]*>
where,
<: starting character
\s*: it may have whitespaces before tag name (ugly but possible).
(\w+): tags can contain letters and numbers (h1). Well, \w also matches '_', but it does not hurt I guess. If curious use ([a-zA-Z0-9]+) instead.
[^/>]*: anything except > and / until closing >
\>: closing >
This question already has answers here:
How to input a regex in string.replace?
(7 answers)
Closed 5 years ago.
I am trying to do a grab everything after the '</html>' tag and delete it, but my code doesn't seem to be doing anything. Does .replace() not support regex?
z.write(article.replace('</html>.+', '</html>'))
No. Regular expressions in Python are handled by the re module.
article = re.sub(r'(?is)</html>.+', '</html>', article)
In general:
str_output = re.sub(regex_search_term, regex_replacement, str_input)
In order to replace text using regular expression use the re.sub function:
sub(pattern, repl, string[, count, flags])
It will replace non-everlaping instances of pattern by the text passed as string. If you need to analyze the match to extract information about specific group captures, for instance, you can pass a function to the string argument. more info here.
Examples
>>> import re
>>> re.sub(r'a', 'b', 'banana')
'bbnbnb'
>>> re.sub(r'/\d+', '/{id}', '/andre/23/abobora/43435')
'/andre/{id}/abobora/{id}'
You can use the re module for regexes, but regexes are probably overkill for what you want. I might try something like
z.write(article[:article.index("</html>") + 7]
This is much cleaner, and should be much faster than a regex based solution.
For this particular case, if using re module is overkill, how about using split (or rsplit) method as
se='</html>'
z.write(article.split(se)[0]+se)
For example,
#!/usr/bin/python
article='''<html>Larala
Ponta Monta
</html>Kurimon
Waff Moff
'''
z=open('out.txt','w')
se='</html>'
z.write(article.split(se)[0]+se)
outputs out.txt as
<html>Larala
Ponta Monta
</html>