Passing REGEX string into re.search - python

I am stuck when trying to substitute a variable into a re.search.
I use the following code to gather a stored regex from a file and save it to the variable "regex." In this example the stored regex is used to find ip addresses with port numbers from a log message.
for line in workingconf:
regexsearch = re.search(r'regex>>>(.+)', line)
if regexsearch:
regex = regexsearch.group(1)
print regex
#I use re.search to go through "data" to find a match.
data = '[LOADBALANCER] /Common/10.10.10.10:10'
alertforsrch = re.search(r'%s' % regex, data)
if alertforsrch:
print "MATCH"
print alertforsrch.group(1)
else:
print "no match"
When this program runs I get the following.
$ ./messageformater.py
/Common/([\d]{1,}\.[\d]{1,}\.[\d]{1,}\.[\d]{1,}:[\d]{1,})
no match
when I change re.search to the following it works. The regex will be obtained from the file and may not be the same every time. That is why I am trying to use a variable.
for line in workingconf:
regexsearch = re.search(r'regex>>>(.+)', line)
if regexsearch:
regex = regexsearch.group(1)
print regex
alertforsrch = re.search(r'/Common/([\d]{1,}\.[\d]{1,}\.[\d]{1,}\.[\d]{1,}:[\d]{1,})', data)
if alertforsrch:
print "MATCH"
print alertforsrch.group(1)
else:
print "no match"
####### Results ########
$./messageformater.py
/Common/([\d]{1,}\.[\d]{1,}\.[\d]{1,}\.[\d]{1,}:[\d]{1,})
MATCH
10.10.10.10:10

Works fine for me...
Why even bother with the string formatter though? re.search(regex, data) should work fine.
You may have a newline character at the end of the regex read in from the file - try re.search(regex.strip(), data)

Related

Python Regex with Paramiko

I am using Python Paramiko module to sftp into one of my servers. I did a list_dir() to get all of the files in the folder. Out of the folder I'd like to use regex to find the matching pattern and then printout the entire string.
List_dir will list a list of the XML files with this format
LOG_MMDDYYYY_HHMM.XML
LOG_07202018_2018 --> this is for the date 07/20/2018 at the time 20:18
Id like to use regex to file all the XML files for that particular date and store them to a list or a variable. I can then pass this variable to Paramiko to get the file.
for log in file_list:
regex_pattern = 'POSLog_' + date + '*'
if (re.search(regex_pattern, log) != None):
matchObject = re.findall(regex_pattern, log)
print(matchObject)
the code above just prints:
['Log_07202018'] I want it to store the entire string Log_07202018_20:18.XML to a variable.
How would I go about doing this?
Thank you
If you are looking for a fixed string, don't use regex.
search_str = 'POSLog_' + date
for line in file_list:
if search_str in line:
print(line)
Alternatively, a list comprehension can make list of matching lines in one go:
log_lines = [line for line in file_list if search_str in line]
for line in log_lines:
print(line)
If you must use regex, there are a few things to change:
Any variable part that you put into the regex pattern must either be guaranteed to be a regex itself, or it must be escaped.
"The rest of the line" is not *, it's .*.
The start-of-line anchor ^ should be used to speed up the search - this way the regex fails faster when there is no match on a given line.
To support the ^ on multiple lines instead of only at the start of the entire string, the MULTILINE flag is needed.
There are several ways of getting all matches. One could do "for each line, if there is a match, print line", same as above. Here I'm using .finditer() and a search over the whole input block (i.e. not split into lines).
log_pattern = '^POSLog_' + re.escape(date) + '.*'
for match in re.finditer(log_pattern, whole_file, re.MULTILINE):
print(match.string)
Because you only print the matched part, just do print(log) instead and it'll print the whole filename.

Finding multiple lines with a regex?

I am trying to complete a "Regex search" project from the book Automate boring stuff with python. I tried searching for answer, but I failed to find related thread in python.
The task is: "Write a program that opens all .txt files in a folder and searches for any line that matches a user-supplied regular expression. The results should be printed to the screen."
With the below compile I manage to find the first match
regex = re.compile(r".*(%s).*" % search_str)
And I can print it out with
print(regex.search(content).group())
But if I try to use
print(regex.findall(content))
The output is only the inputted word/words, not the whole line they are on. Why won't findall match the whole line, even though that is how I compiled the regex?
My code is as follows.
# Regex search - Find user given text from a .txt file
# and prints the line it is on
import re
# user input
print("\nThis program searches for lines with your string in them\n")
search_str = input("Please write the string you are searching for: \n")
print("")
# file input
file = open("/users/viliheikkila/documents/kooditreeni/input_file.txt")
content = file.read()
file.close()
# create regex
regex = re.compile(r".*(%s).*" % search_str)
# print out the lines with match
if regex.search(content) is None:
print("No matches was found.")
else:
print(regex.findall(content))
In python regex, parentheses define a capturing group. (See here for breakdown and explanation).
findall will only return the captured group. If you want the entire line, you will have to iterate over the result of finditer.

Matching with regex line for line

I am working on a fun little language using regex matching lines in a file. Here is what I have so far:
import re
code=open("code.txt", "r").read()
outputf=r'output (.*)'
inputf=r'(.*) = input (.*)'
intf=r'int (.*) = (\d)'
floatf=r'float (.*) = (\d\.\d)'
outputq=re.match(outputf, code)
if outputq:
print "Executing OUTPUT query"
exec "print %s" %outputq.group(1)
inputq=re.match(inputf, code)
if inputq:
print "Executing INPUT query"
exec "%s=raw_input(%s)"%(inputq.group(1), inputq.group(2))
intq=re.match(intf, code)
if intq:
exec "%s = %s"%(intq.group(1), intq.group(2))
exec "print %s"%(intq.group(1))
else:
print "Invalid syntax"
The code works in matching say:
int x = 1
But it will only match the first line and stop matching and ignore the rest of the code that I want to match. How can I match every line in the file to my regex definitions?
.read() reads as one line, use .split("\n") on the .read() code or use .readlines().
Then iterate over the lines and test for your commands.
At the moment you take the whole code as one single line. You want to check all lines line by line.
EDIT:
for that, create a function
then read lines with readlines()
And finally iterate over lines, using the function on lines
Like that:
import re
outputf=r'output (.*)'
inputf=r'(.*) = input (.*)'
intf=r'int (.*) = (\d)'
floatf=r'float (.*) = (\d\.\d)'
def check_line(line):
outputq=re.match(outputf, line)
if outputq:
print ("Executing OUTPUT query")
exec ("print (%s)" % outputq.group(1))
inputq=re.match(inputf, line)
if inputq:
print ("Executing INPUT query")
exec ("%s=raw_input(%s)"%(inputq.group(1), inputq.group(2)))
intq=re.match(intf, line)
if intq:
exec ("%s = %s"%(intq.group(1), intq.group(2)))
exec ("print (%s)"%(intq.group(1)))
else:
print ("Invalid syntax")
code=open("code.txt", "r").readlines()
for line in code:
check_line(line)
This code will still return an error, which has nothing to do with the issue tho, think about if you do the assigning of the variable correctly.
You're using re.match() which means that your regex has to match the whole string (which in this case is the whole file). If you iterate over each line in the file, then .match() will work. Alternatively you might want to look at re.search(), re.findall() and other similar alternatives.
It looks like your code needs to iterate over the lines in the file: How to iterate over the file in python

Pyparsing finds first occurence in file

I'm parsing file via
output=wilcard.parseFile(myfile)
print output
And I do get only first match of string.
I have a big config file to parse, with "entries" which are surrounded by braces.
I expect to see all the matches that are in file or exception for not matching.
How do I achieve that?
By default, pyparsing will find the longest match, starting at the first character. So, if your parse is given by num = Word('0123456789'), parsing either "462" or "462-780" will both return the same value. However, if the parseAll=True option is passed, the parse will attempt to parse the entire string. In this case, "462" would be matched, but parsing "462-780" would raise a ParseException, because the parser doens't know how to deal with the dash.
I would recommend constructing something that will match the entirety of the file, then using the parseAll=True flag in parseFile(). If I understand your description of each entry being separated by braces correctly, one could do the following.
entire_file = OneOrMore('[' + wildcard + ']')
output = wildcard.parseFile(myfile,parseAll=True)
print output

Python IP Parsing

I am working with a SIEM and need to be able to parse IP addresses from relatively large files. They dont have consistent fields so "cut" is not an option. I am using a modified python script to remove all characters except a-z A-Z 0-9 and period "." so that the file can be properly parsed. The issue is this does not work with my SIEM files. If I have a text file that looks like this "192.168.1.2!##$!#%#$" it is fine, it will properly drop all of the characters I do not need, and output just the IP to a new file. The issue is, if the file looks like this "192.168.168.168##$% this is a test" it will leave it alone after the first stage of removing abnormal characters. Please help, I have no idea why it does this. Here is my code:
#!/usr/bin/python
import re
import sys
unmodded = raw_input("Please enter the file to parse. Example: /home/aaron/ipcheck: ")
string = open(unmodded).read()
new_str = re.sub('[^a-zA-Z0-9.\n\.]', ' ', string)
open('modifiedipcheck.txt', 'w').write(new_str)
try:
file = open('modifiedipcheck.txt', "r")
ips = []
for text in file.readlines():
text = text.rstrip()
regex = re.findall(r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?: [\d]{1,3})$',text)
if regex is not None and regex not in ips:
ips.append(regex)
for ip in ips:
outfile = open("checkips", "a")
combine = "".join(ip)
if combine is not '':
print "IP: %s" % (combine)
outfile.write(combine)
outfile.write("\n")
finally:
file.close()
outfile.close()
Anyone have any ideas? Thanks a lot in advance.
Your regex ends with $, which indicates that it expects the line to end at that point. If you remove that, it should work fine:
regex = re.findall(r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', text)
You can also simplify the regex itself further:
regex = re.findall(r'(?:\d{1,3}\.){3}\d{1,3}', text)
Here is what I think is happening. You have a pattern that looks for garbage characters and replaces them with a space. When you have an IP address followed by nothing but garbage, the garbage is turned to spaces, and then when you strip the string the spaces are gone, leaving nothing but the address you want to match.
Your pattern ends in a $ so it is anchored to the end of the line, so when the address is the last thing on the line, it matches.
When you have this is a test then there are non-garbage characters that are left alone, strip doesn't remove them, then the $ means that the IP address doesn't match.

Categories