This question already has an answer here:
python regex first/shortest match
(1 answer)
Closed 6 months ago.
I am parsing some text with Python and am running into an odd issue...
an example text that is being parsed:
msg:"ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i"; reference:url,www.securityfocus.com/bid/37446/info; reference:url,doc.emergingthreats.net/2010602; classtype:web-application-attack; sid:2010602; rev:4; metadata:created_at 2010_07_30, updated_at 2010_07_30;
my regex:
msgSearch = re.search(r'msg:"(.+)";",line)
actual result:
ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt"; flow:established,to_server; content:"GET"; content:"script"; nocase; content:"/proxy.php?"; nocase; content:"url="; nocase; pcre:"//proxy.php(?|.[\x26\x3B])url=[^&;\x0D\x0A][<>"']/i
expected result:
ET WEB_SPECIFIC_APPS ClarkConnect Linux proxy.php XSS Attempt
There are 10s of thousands of lines of text that I am parsing that are all giving me similar results. Any reason regex is picking a (seemingly) random "; to stop at? I can fix the example above by making the regex more specific, eg. r'msg:"([\w\s\.]+)";" but other lines have different characters included. I guess I could just include every special character in my regex, but I'm trying to understand why my wildcard isn't working properly.
Any help would be appreciated!
Try this one:
re.search(r'msg:"([^;]+)";',line)
The .+ is by default "greedy", i.e. it will match as many characters as possible. In your case, it will stop at the last "; sequence, not at the next one. To make it non-greedy (or lazy), try .+? :
msgSearch = re.search(r'msg:"(.+?)";",line)
Related
I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101
This question already has answers here:
Parsing HTML using Python
(7 answers)
Closed 3 years ago.
I have a string like this:
string = r'''<img height="233" src="monline/" title="email example" width="500" ..
title="second example title" width="600"...
title="one more title"...> '''
I am trying to get anything that appears as title (title="Anything here")
I have already tried this but it does not work correctly.
re.findall(r'title=\"(.*)\"',string)
I think your Regex is too Greedy. You can try something like this
re.findall(r'title=\"(?P<title>[\w\s]+)\"', string)
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. See other SO Answers for more context. There are a few common tools for this like:
https://docs.python.org/3/library/html.parser.html
https://www.simplifiedpython.net/parsing-html-in-python/
https://github.com/psf/requests-html / Get html using Python requests?
If you would like to read more on performance testing of different python HTML parsers you can learn more here
As #Austin and #Plato77 said in the comments, there is a better way to parse HTML in python. I stand by this too, but if you want to get it done through regex this may help
c = re.finditer(r'title=[\"]([a-zA-Z0-9\s]+)[\" ]', string)
for i in c:
print(i.group(1))
The problem here is that the next " symbol is parsed as a character and is considered part of the (.*) of your RE. For your usecase, you can use only letters and numbers.
This question already has answers here:
String concatenation without '+' operator
(6 answers)
Closed 4 years ago.
I read that anything between triple quotes inside print is treated literal so tried messing things a little bit. Now I am not able to get above statement working. I searched internet but could not find anything.
statement:
print("""Hello World's"s""""")
Output I am getting:
Hello World's"s
Expected output:
Hello World's"s""
print("""Hello World's"s""""") is seen as print("""Hello World's"s""" "") because when python find """ it automatically ends the previous string beginning with a triple double-quote.
Try this:
>>> print("a"'b')
ab
So basically your '"""Hello World's"s"""""' is just <str1>Hello World's"s</str1><str2></str2> with str2 an empty string.
Triple quoted string is usually used for doc-string.
As #zimdero pointed out Triple-double quote v.s. Double quote
You can also read https://stackoverflow.com/a/19479874/1768843
And https://www.python.org/dev/peps/pep-0257/
If you really want to get the result you want just use \" or just you can do combination with ``, .format() etc
print("Hello World's\"s\"\"")
https://repl.it/repls/ThatQuarrelsomeSupercollider
Triple quotes within a triple-quoted string must still be escaped for the same reason a single quote within a single quoted string must be escaped: The string parsing ends as soon as python sees it. As mentioned, once tokenized your string is equivalent to
"""Hello World's"s""" ""
That is, two strings which are then concatenated by the compiler. Triple quoted strings can include newlines. Your example is similar to
duke = """Thou seest we are not all alone unhappy:
This wide and universal theatre
Presents more woeful pageants than the scene
Wherein we play in."""
jaques = """All the world's a stage,
And all the men and women merely players:
They have their exits and their entrances;
And one man in his time plays many parts."""
If python was looking for the outermost triple quotes it would only have defined one string here.
Simple with ''' to not complicate things:
print('''Hello World's"s""''')
Maybe this is what you are looking for?
print("\"\"Hello World's's\"\"")
Output:
""Hello World's's""
This question already has an answer here:
Reference - What does this regex mean?
(1 answer)
Closed 7 years ago.
I am not very familiar with regex(s) and would like somebody to put this into something that I will be able to understand? As in, outline what each part of the regex is doing
re.compile(r'ATG((?:[ACTG]{3})+?)(?:TAG|TAA|TGA)')
So far, this is what I have come up with:
re.compile is a regex method... or something along those lines
r' is simply needed in regex
After that, I'm not too sure...
Searches for a piece in the string ATG
?:[ACTG]{3} searches for a piece of the string containing the characters A C T G within the string (does the order of these matter?) that is {3} three characters long.
+? something about going at least once, but minimal times...? What would part of code would be going at least once, but minimal times?
?: searches for TAG|TAA|TGAwithin the string. Once it finds these, what does happens?
Would I be able to do something like
key_words = "TAG TAA TGA".replace(" ", "|") so that I can have a whole long list without having to type of | a bunch of times if I have over 100 substrings?
I would then format this to something like this:
...(?:key_words)')
Examples and simple explanations always work wonders - thanks!
You can use regex101 to have it explained step by step.
I know from this question that, nothing to repeat in a regex expression, is a known bug of python.
But I must compile this unicode expression
re.compile(u'\U0000002A \U000020E3')
as a unique character. This is an emoticon and is a unique character. Python understand this string as u'* \\u20e3' and rise me 'nothing to repeat' error.
I am looking around but I can't find any solution. Does exist any work around?
This has little to do with the question you linked. You're not running into a bug. Your regex simply has a special character (a *) that you haven't escaped.
Simply escape the string before compiling it into a regex:
re.compile(re.escape(u'\U0000002A \U000020E3'))
Now, I'm a little unsure as to why you're representing * as \U0000002A — perhaps you could clarify what your intent is here?
You need to use re.escape (as shown in "Thomas Orozco" answer)
But use it only on the part that is dynamic such as:
print re.findall( u"cool\s*%s" % re.escape(u'\U0000002A \U000020E3'),
u"cool * \U000020E3 crazy")