how to extract values inside quote using regex python 3.x - python

I have following
\"url\": \"\\\/maru.php?superalox=1\", \"params\": {\"params\": \"EgtmbUtXdUZWdDF2SSoCCABQAQ%3D%3D\", \"session_value\": \"QUFFLUhqbnQ4eW5HeEZYdDBULU5EVk1LREU2VndMMm1nd3xBQ3Jtc0trMUlMOWRqTWxpS0pOT2pNUVN6RENVU3k0Tmc4blplodexsWkxrVDRmOUN2Q0lXVkl1N0YwUFhoV1puQ3ZFQm10X1RzNWR4Q3RUeG5kMkdLNnNobTUyRkNuaG90d2c=\"}, \"log_params\":
i want to extract the value of params which is EgtmbUtXdUZWdDF2SSoCCABQAQ%3D%3D
i have tried this but it didnt work
my_text = """ \"url\": \"\\\/maru.php?superalox=1\", \"params\": {\"params\": \"EgtmbUtXdUZWdDF2SSoCCABQAQ%3D%3D\", \"session_value\": \"QUFFLUhqbnQ4eW5HeEZYdDBULU5EVk1LREU2VndMMm1nd3xBQ3Jtc0trMUlMOWRqTWxpS0pOT2pNUVN6RENVU3k0Tmc4blplodexsWkxrVDRmOUN2Q0lXVkl1N0YwUFhoV1puQ3ZFQm10X1RzNWR4Q3RUeG5kMkdLNnNobTUyRkNuaG90d2c=\"}, \"log_params\": """
extract_data = re.search(r'(\\\"params\": \\\")(\w*)', my_text)
print(extract_data)
Thanks

You can use:
re.search(r'"params": "([^"]+)"', my_text).group(1)

I've been using the following with great success:
(["'])(?:(?=(\\?))\2.)*?\1
It supports nested quotes as well.
This also works: (["'])(\?.)*?\1 Easier to read.
For those who want a deeper explanation of how this works, here's an explanation from user ephemient:
([""']) match a quote; ((?=(\\?))\2.) if backslash exists, gobble it, and whether or not that happens, match a character; *? match many times (non-greedily, as to not eat the closing quote); \1 match the same quote that was use for opening.
This is copied from another answer found here: RegEx: Grabbing values between quotation marks

Related

Extract values in name=value lines with regex

I'm really sorry for asking because there are some questions like this around. But can't get the answer fixed to make problem.
This are the input lines (e.g. from a config file)
profile2.name=share2
profile8.name=share8
profile4.name=shareSSH
profile9.name=share9
I just want to extract the values behind the = sign with Python 3.9. regex.
I tried this on regex101.
^profile[0-9]\.name=(.*?)
But this gives me the variable name including the = sign as result; e.g. profile2.name=. But I want exactly the inverted opposite.
The expected results (what Pythons re.find_all() return) are
['share2', 'share8', 'shareSSH', 'share9']
Try pattern profile\d+\.name=(.*), look at Regex 101 example
import re
re.findall('profile\d+\.name=(.*)', txt)
# output
['share2', 'share8', 'shareSSH', 'share9']
But this problem doesn't necessarily need regex, split should work absolutely fine:
Try removing the ? quantifier. It will make your capture group match an empty st
regex101

Python - Injecting html tags into strings based on regex match

I wrote a script in Python for custom HTML page that finds a word within a string/line and highlights just that word with use of following tags where instance is the word that is searched for.
<b><font color=\"red\">"+instance+"</font></b>
With the following result:
I need to find a word (case insensitive) let's say "port" within a string that can be port, Port, SUPPORT, Support, support etc, which is easy enough.
pattern = re.compile(word, re.IGNORECASE)
find_all_instances = pattern.findall(string_to_search)
However my strings often contain 2 or more instances in single line, and I need to append
<b><font color=\"red\">"+instance+"</font></b> to each of those instances, without changing cases.
Problem with my approach, is that I am attempting to itterate over each of instances found with findall (exact match),
while multiple same matches can also be found within the string.
for instance in find_all_instances:
second_pattern = re.compile(instance)
string_to_search = second_pattern.sub("<b><font color=\"red\">"+instance+"</font></b>", string_to_search)
This results in following:
<b><font color="red"><b><font color="red"><b><font color="red">Http</font></b></font></b></font></b></font>
when I need
<b><font color="red">Http</font></b>
I was thinking, I would be able to avoid this if I was able to find out exact part of the string that the pattern.sub substitutes at the moment of doing it,
however I was not able to find any examples of that kind of usage, which leads me to believe that I am doing something very wrong.
If anyone have a way I could use to insert <b><font color="red">instance</font></b> without replacing instance for all matches(case insensitive), then I would be grateful.
Maybe I'm misinterpretting your question, but wouldn't re.sub be the best option?
Example: https://repl.it/DExs
Okay so two ways I did quickly! The second loop is definitely the way to go. It uses re.sub (as someone else commented too). It replaces with the lowercase search term bear in mind.
import re
FILE = open("testing.txt","r")
word="port"
#THIS LOOP IS CASE SENSITIVE
for line in FILE:
newline=line.replace(word,"<b><font color=\"red\">"+word+"</font></b>")
print newline
#THIS LOOP IS INCASESENSITIVE
for line in FILE:
pattern=re.compile(word,re.IGNORECASE)
newline = pattern.sub("<b><font color=\"red\">"+word+"</font></b>",line)
print newline

using \b in regex

--SOLVED--
I solved my issue by enabling multiline mode, and now the characters ^ and $ work perfectly for identifying the beginning and end of each string
--EDIT--
My code:
import re
import test_regex
def regex_content(text_content, regex_dictionary):
#text_content = text_content.lower()
regex_matches = []
# Search sanitized text (markup removed) for DLP theme keywords
for key,value in regex_dictionary.items():
# Get confiiguration settings
min_matches = value.get('min_matches',1)
risk = value.get('risk',1)
enabled = value.get('enabled',False)
regex_str = value.get('regex','')
# Fast compute True/False hit for each DLP theme word
if enabled:
print "Searching for key : %s" % (key)
my_regex = re.compile(value.get('regex'))
hits = my_regex.findall(text_content)
if len(hits) > 0:
regex_matches.append((key, risk, len(hits), hits))
# Return array of results (key, risk, number of hits, regex matches)
return regex_matches
def main():
#print defaults.test_regex.dlp_regex
text_content = ""
for line in open('testData.txt'):
text_content+=line
for match in regex_content(text_content, test_regex.dlp_regex):
print "\nFound %s : %s" % (match[0], match[3])
print "\n"
if __name__ == '__main__':
main()
and it is using the regex found here:
'Large number of US Zip Codes' : { 'regex' : "\b\d{5}(?:-\d{1,4})?\b"},
When I precede my regex with the 'r' flag, I can find the zip codes I'm looking for, but as well as every other 5 digit number in my document I am searching through. From my understanding this is because it ignored the \b characters. Without the r flag though, it cannot find any zip codes. It works perfectly fine in regexr, but not in my code. I haven't had any luck making \b characters work, nor ^ and $ for identifying the beginnings and ends of the strings I'm searching for. What is it that I am misunderstanding about these special characters?
--Original post--
I am writing a regex for identifying zip codes (and only zip codes), so to avoid false positives I am trying to include a boundary on my regex, using both of the following:
\b\d{5}\b|\b\d{5}-\b\d{1,4}\b
using the online regex debugger Regexr, my code should correctly catch 5 digit zip codes, such as 34332. However, I have two problems:
1. This regex is not working in my actual code for finding any zip codes, but it does work when I don't have the boundary (\b) characters. The exact code I'm trying to extract with my regex is:
Zip:
----
98839-0111
34332
2. I don't see why my regex can't correctly identify 98839-0111 in Regexr. I tried doing the super-primitive approach of
\b\d{5}\b|98839-0111
and even that couldn't identify 98839-0111. Does anyone know what could be going on?
Note: I have also tried using ^ and $ for the boundaries of my regex, but this also doesn't find the regex's, not even in Regexr.
EDIT: After removing the first part of my regex, leaving only
98839-0111
It can now correctly identify it. I guess this means that once a string is pulled out by one of my regex's, it can no longer be found by any subsequent regexs? Why is this?
It is because of the alternative list: the first part was matched, and the engine stopped checking.
Try this regex
98839-0111|\b\d{5}\b
And you'll get a match.
Or, to be more generic in your case:
\b(?:\d{5}-\d{4}|\d{5})\b
will match both, and more (actually, functionally the same as \b\d{5}(?:-\d{4})?\b). See demo.
Your pattern is evaluated for each position in the string from the left to the right, so if the left branch of your pattern succeeds, the second branch isn't tested at all.
I suggest you to use this pattern that solves the problem:
\b\d{5}(?:-\d{1,4})?\b
You can use this regex:
\b(\d{5}-\d{1,4}|\d{5})\b
Working demo

printing out optional regex

I am trying to print out values stored in a dictionary. These values were created from regular expressions.
Currently I have some optional fields but I am not sure if I am doing this correctly
(field A(field B)? field C (field D)?)?
I read a quick reference and it said that ? means 0 or 1 occurrence.
When I try to search for a field such as reputation or content-type I get None because these are optional on my regex. I might have the wrong regular expression but I am wondering why whenever I search for an optional field (...)? it prints out None
my code:
import re
httpproxy515139 = re.compile(r'....url\=\"(?P<url>(.*))\"(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?')
f = open("sophos-httpproxy.out", "r")
fw = open("sophosfilter.log", "w+")
HttpProxyCount = 0
otherCount = 0
for line in f.readlines():
HttpProxy = re.search(httpproxy515139, line)
HttpProxy.groupdict()
print "AV Field: "
print "Date/Time: " + str(HttpProxy.groupdict()['categoryname'])
here is the full regex:
(?P<datetime>\w\w\w\s+\d+\s+\d\d:\d\d:\d\d)\s+(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*httpproxy\[(?P<HTTPcode>(.*))\]:\s+id\=\"(?P<id>([^\"]*))\"\s+severity\=\"(?P<Severity>([^\"]*))\"\s+sys\=\"(?P<sys>([^\"]*))\"\s+sub\=\"(?P<sub>([^\"]*))\"\s+name\=\"(?P<name>([^\"]*))\"\s+action\=\"(?P<action>([^\"]*))\"\s+method\=\"(?P<method>([^\"]*))\"\s+srcip\=\"(?P<srcip>([^\"]*))\"\s+dstip\=\"(?P<dstip>([^\"]*))\"\s+user\=\"(?P<user>[^\"]*)\"\s+statuscode\=\"(?P<statuscode>([^\"]*))\"\s+cached\=\"(?P<cached>([^\"]*))\"\s+profile\=\"(?P<profile>([^\"]*))\"\s+filteraction\=\"(?P<filteraction>([^\"]*))\"\s+size\=\"(?P<size>([^\"]*))\"\s+request\=\"(?P<request>([^\"]*))\"\s+url\=\"(?P<url>(.*))\"(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?
Here is a sample input:
Oct 7 13:22:55 192.168.10.2 2013: 10:07-13:22:54 httpproxy[15359]: id="0001" severity="info" sys="SecureWeb" sub="http" name="http access" action="pass" method="GET" srcip="192.168.8.47" dstip="64.94.90.108" user="" statuscode="200" cached="0" profile="REF_DefaultHTTPProfile (Default Proxy)" filteraction="REF_DefaultHTTPCFFAction (Default content filter action)" size="1502" request="0x10870200" url="http://www.concordmonitor.com/csp/mediapool/sites/dt.common.streams.StreamServer.cls?STREAMOID=6rXcvJGqsPivgZ7qnO$Sic$daE2N3K4ZzOUsqbU5sYvZF78hLWDhaM8n_FuBV1yRWCsjLu883Ygn4B49Lvm9bPe2QeMKQdVeZmXF$9l$4uCZ8QDXhaHEp3rvzXRJFdy0KqPHLoMevcTLo3h8xh70Y6N_U_CryOsw6FTOdKL_jpQ-&CONTENTTYPE=image/jpeg" exceptions="" error="" category="134" reputation="neutral" categoryname="General News" content-type="image/jpeg"
I ma trying to capture the entire log
However sometimes the url has many quotation marks in it which makes things confusing. Also in some logs, there is an extra reputation data field between error and reputation. content-type also does not always appear. Sometimes everything after the url data field is missing as well. This is why I added all the optional ?. I am trying to take account of these occurrences and print None when necessary.
Let's break your regex into two pieces:
....url\=\"(?P<url>(.*))\"
and
(\s+exceptions\=\"(?P<exceptions>([^\"]*))\"\s+error\=\"(?P<error>([^\"]*))\"\s+(reputation\=\"(?P<reputation_opt>([^\"]*))\"\s+)?category\=\"(?P<category>([^\"]*))\"\s+reputation\=\"(?P<reputation>([^\"]*))\"\s+categoryname\=\"(?P<categoryname>([^\"]*))\"\s+(content-type\=\"(?P<content_type>([^\"]*))\")?)?
The .* in the first part is greedy. It'll match everything it can, only backtracking if absolutely necessary.
The second part is one giant optional group.
When the regex executes, the .* will match everything up to the end of the string, then backtrack as necessary until the \" can match a quotation mark. That will be the last quotation mark in the string, and it will probably not be the one you wanted it to be.
Then, the giant optional group will try to match, but since the greedy .* already ate everything the giant optional group was supposed to parse, it'll fail. Since it's optional, the regex algorithm will be fine with that.
To fix this? Well, non-greedy quantifiers might help with the immediate problem, but a better solution is probably to stop trying to use a regex to parse this. Look for existing parsers for your data format. Are you trying to pull data out of HTML or XML? I've seen a lot of recommendations for BeautifulSoup.

Extracting a URL's in Python from XML

I read this thread about extracting url's from a string. https://stackoverflow.com/a/840014/326905
Really nice, i got all url's from a XML document containing http://www.blabla.com with
>>> s = '<link href="http://www.blabla.com/blah" />
<link href="http://www.blabla.com" />'
>>> re.findall(r'(https?://\S+)', s)
['http://www.blabla.com/blah"', 'http://www.blabla.com"']
But i can't figure out, how to customize the regex to omit the double qoute at the end of the url.
First i thought that this is the clue
re.findall(r'(https?://\S+\")', s)
or this
re.findall(r'(https?://\S+\Z")', s)
but it isn't.
Can somebody help me out and tell me how to omit the double quote at the end?
Btw. the questionmark after the "s" of https means "s" can occur or can not occur. Am i right?
>>>from lxml import html
>>>ht = html.fromstring(s)
>>>ht.xpath('//a/#href')
['http://www.blabla.com/blah', 'http://www.blabla.com']
You're already using a character class (albeit a shorthand version). I might suggest modifying the character class a bit, that way you don't need a lookahead. Simply add the quote as part of the character class:
re.findall(r'(https?://[^\s"]+)', s)
This still says "one or more characters not a whitespace," but has the addition of not including double quotes either. So the overall expression is "one or more character not a whitespace and not a double quote."
You want the double quotes to appear as a look-ahead:
re.findall(r'(https?://\S+)(?=\")', s)
This way they won't appear as part of the match. Also, yes the ? means the character is optional.
See example here: http://regexr.com?347nk
I used to extract URLs from text through this piece of code:
url_rgx = re.compile(ur'(?i)\b((?:https?://|www\d{0,3}[.]|[a-z0-9.\-]+[.][a-z]{2,4}/)(?:[^\s()<>]+|\(([^\s()<>]+|(\([^\s()<>]+\)))*\))+(?:\(([^\s()<>]+|(\([^\s()<>]+\)))*\)|[^\s`!()\[\]{};:\'".,<>?\xab\xbb\u201c\u201d\u2018\u2019]))')
# convert string to lower case
text = text.lower()
matches = re.findall(url_rgx, text)
# patch the 'http://' part if it is missed
urls = ['http://%s'%url[0] if not url[0].startswith('http') else url[0] for url in matches]
print urls
It works great!
Thanks. I just read this https://stackoverflow.com/a/13057368/326905
and checked out this which is also working.
re.findall(r'"(https?://\S+)"', urls)

Categories