I am working with a SIEM and need to be able to parse IP addresses from relatively large files. They dont have consistent fields so "cut" is not an option. I am using a modified python script to remove all characters except a-z A-Z 0-9 and period "." so that the file can be properly parsed. The issue is this does not work with my SIEM files. If I have a text file that looks like this "192.168.1.2!##$!#%#$" it is fine, it will properly drop all of the characters I do not need, and output just the IP to a new file. The issue is, if the file looks like this "192.168.168.168##$% this is a test" it will leave it alone after the first stage of removing abnormal characters. Please help, I have no idea why it does this. Here is my code:
#!/usr/bin/python
import re
import sys
unmodded = raw_input("Please enter the file to parse. Example: /home/aaron/ipcheck: ")
string = open(unmodded).read()
new_str = re.sub('[^a-zA-Z0-9.\n\.]', ' ', string)
open('modifiedipcheck.txt', 'w').write(new_str)
try:
file = open('modifiedipcheck.txt', "r")
ips = []
for text in file.readlines():
text = text.rstrip()
regex = re.findall(r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?: [\d]{1,3})$',text)
if regex is not None and regex not in ips:
ips.append(regex)
for ip in ips:
outfile = open("checkips", "a")
combine = "".join(ip)
if combine is not '':
print "IP: %s" % (combine)
outfile.write(combine)
outfile.write("\n")
finally:
file.close()
outfile.close()
Anyone have any ideas? Thanks a lot in advance.
Your regex ends with $, which indicates that it expects the line to end at that point. If you remove that, it should work fine:
regex = re.findall(r'(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})\.(?:[\d]{1,3})', text)
You can also simplify the regex itself further:
regex = re.findall(r'(?:\d{1,3}\.){3}\d{1,3}', text)
Here is what I think is happening. You have a pattern that looks for garbage characters and replaces them with a space. When you have an IP address followed by nothing but garbage, the garbage is turned to spaces, and then when you strip the string the spaces are gone, leaving nothing but the address you want to match.
Your pattern ends in a $ so it is anchored to the end of the line, so when the address is the last thing on the line, it matches.
When you have this is a test then there are non-garbage characters that are left alone, strip doesn't remove them, then the $ means that the IP address doesn't match.
Related
I am trying to figure out how to write a simple regex that would highlight newline characters only if they appear at the beginning or end of some data while preserving the newline.
In the below example, line 1 and line 14 both are new lines. Those are the only two lines I am trying to highlight as they appear at the beginning and end of the data.
import regex as re
from colorama import Fore, Back
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
print(
re.sub(r'(^\n|\n$)', red(r'\1'), data)
)
In the open expression, data is the same content as the example posted above.
In the above example, this is the result I am getting:
As one can see, the red highlight is missing on line 1 and is spanning all the way in line 14. What I would like is for the color to appear only once per new line character.
You can actually use your regex, but without the "multiline" flag. Than it will see the whole string as one and you will actually match your desired output.
^\n|\n$
Here you can see that there are two matches. And if you delete new lines in front or in the end, the matches will disapear. The multilene flag is set or disabled at the end of the regex line. You could do that in your language too.
https://regex101.com/r/pSRHPU/2
After reading all the comments, and suggestions, and combining a subset of them all, I finally have a working version. For anyone that is interested:
One issue I cannot overcome without writing an os specific check is how an extra new line being added for windows.
A couple of highlights that were picked up:
cannot color a \n. So replace that with a space and newline.
have not tested this, but by getting rid of the group replacement, it may be possible to apply this to bytes also.
Windows supported can be attained with init in colorama
import regex as re
from colorama import Back, init
init() # for windows
def red(s):
return Back.RED + s + Back.RESET
with open('/tmp/1.py', 'r') as f:
data = f.read()
fist_line = re.sub('\A\n', red(' ')+'\n', data)
last_line = re.sub('\n\Z', '\n'+red(' '), fist_line)
print(last_line)
OSX/Linux
Windows
I found a way that seems to allow you to match the start/end of the whole string. See the "Permanent Start of String and End of String Anchors" part from https://www.regular-expressions.info/anchors.html
\A only ever matches at the start of the string. Likewise, \Z only ever matches at the end of the string.
I created a demo here https://regex101.com/r/n2DAWh/1
Regex is: (\A\n|\n\Z)
I am using Python Paramiko module to sftp into one of my servers. I did a list_dir() to get all of the files in the folder. Out of the folder I'd like to use regex to find the matching pattern and then printout the entire string.
List_dir will list a list of the XML files with this format
LOG_MMDDYYYY_HHMM.XML
LOG_07202018_2018 --> this is for the date 07/20/2018 at the time 20:18
Id like to use regex to file all the XML files for that particular date and store them to a list or a variable. I can then pass this variable to Paramiko to get the file.
for log in file_list:
regex_pattern = 'POSLog_' + date + '*'
if (re.search(regex_pattern, log) != None):
matchObject = re.findall(regex_pattern, log)
print(matchObject)
the code above just prints:
['Log_07202018'] I want it to store the entire string Log_07202018_20:18.XML to a variable.
How would I go about doing this?
Thank you
If you are looking for a fixed string, don't use regex.
search_str = 'POSLog_' + date
for line in file_list:
if search_str in line:
print(line)
Alternatively, a list comprehension can make list of matching lines in one go:
log_lines = [line for line in file_list if search_str in line]
for line in log_lines:
print(line)
If you must use regex, there are a few things to change:
Any variable part that you put into the regex pattern must either be guaranteed to be a regex itself, or it must be escaped.
"The rest of the line" is not *, it's .*.
The start-of-line anchor ^ should be used to speed up the search - this way the regex fails faster when there is no match on a given line.
To support the ^ on multiple lines instead of only at the start of the entire string, the MULTILINE flag is needed.
There are several ways of getting all matches. One could do "for each line, if there is a match, print line", same as above. Here I'm using .finditer() and a search over the whole input block (i.e. not split into lines).
log_pattern = '^POSLog_' + re.escape(date) + '.*'
for match in re.finditer(log_pattern, whole_file, re.MULTILINE):
print(match.string)
Because you only print the matched part, just do print(log) instead and it'll print the whole filename.
I'm looking to find out how to use Python to get rid of needless newlines in text like what you get from Project Gutenberg, where their plain-text files are formatted with newlines every 70 characters or so. In Tcl, I could do a simple string map, like this:
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
This would keep paragraphs separated by two newlines (or a newline and a tab) separate, but run together the lines that ended with a single newline (substituting a space), and drop superfluous CR's. Since Python doesn't have string map, I haven't yet been able to find out the most efficient way to dump all the needless newlines, although I'm pretty sure it's not just to search for each newline in order and replace it with a space. I could just evaluate the Tcl expression in Python, if all else fails, but I'd like to find out the best Pythonic way to do the same thing. Can some Python connoisseur here help me out?
The nearest equivalent to the tcl string map would be str.translate, but unfortunately it can only map single characters. So it would be necessary to use a regexp to get a similarly compact example. This can be done with look-behind/look-ahead assertions, but the \r's have to be replaced first:
import re
oldtext = """\
This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
\tThis would keep paragraphs separated.
\rWhen, in the course
of human events,
it becomes necessary
\rfor one people
"""
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
output:
This would keep paragraphs separated. This would keep paragraphs separated.
This would keep paragraphs separated.
This would keep paragraphs separated.
When, in the course of human events, it becomes necessary for one people
I doubt whether this is as efficient as the tcl code, though.
UPDATE:
I did a little test using this Project Gutenberg EBook of War and Peace (Plain Text UTF-8, 3.1 MB). Here's my tcl script:
set fp [open "gutenberg.txt" r]
set oldtext [read $fp]
close $fp
set newtext [string map "{\r} {} {\n\n} {\n\n} {\n\t} {\n\t} {\n} { }" $oldtext]
puts $newtext
and my python equivalent:
import re
with open('gutenberg.txt') as stream:
oldtext = stream.read()
newtext = re.sub(r'(?<!\n)\n(?![\n\t])', ' ', oldtext.replace('\r', ''))
print(newtext)
Crude performance test:
$ /usr/bin/time -f '%E' tclsh gutenberg.tcl > output1.txt
0:00.18
$ /usr/bin/time -f '%E' python gutenberg.py > output2.txt
0:00.30
So, as expected, the tcl version is more efficient. However, the output from the python version seems somewhat cleaner (no extra spaces inserted at the beginning of lines).
You can use a regular expression with a look-ahead search:
import re
text = """
...
"""
newtext = re.sub(r"\n(?=[^\n\t])", " ", text)
That will replace any new line that is not followed by a newline or a tab with a space.
I use the following script when I want to do this:
import sys
import os
filename, extension = os.path.splitext(sys.argv[1])
with open(filename+extension, encoding='utf-8-sig') as (file
), open(filename+"_unwrapped"+extension, 'w', encoding='utf-8-sig') as (output
):
*lines, last = list(file)
for line in lines:
if line == "\n":
line = "\n\n"
elif line[0] == "\t":
line = "\n" + line[:-1] + " "
else:
line = line[:-1] + " "
output.write(line)
output.write(last)
A "blank" line, with only a linefeed, turns into two linefeeds (to replace the one removed from the previous line). This handles files that separate paragraphs with two linefeeds.
A line beginning with a tab gets a leading linefeed (to replace the one removed from the previous line) and gets its trailing linefeed replaced with a space. This handles files that separate paragraphs with a tab character.
A line that is neither blank nor beginning with a tab gets its trailing linefeed replace with a space.
The last line in the file may not have a trailing linefeed and therefore gets copied directly.
I am stuck when trying to substitute a variable into a re.search.
I use the following code to gather a stored regex from a file and save it to the variable "regex." In this example the stored regex is used to find ip addresses with port numbers from a log message.
for line in workingconf:
regexsearch = re.search(r'regex>>>(.+)', line)
if regexsearch:
regex = regexsearch.group(1)
print regex
#I use re.search to go through "data" to find a match.
data = '[LOADBALANCER] /Common/10.10.10.10:10'
alertforsrch = re.search(r'%s' % regex, data)
if alertforsrch:
print "MATCH"
print alertforsrch.group(1)
else:
print "no match"
When this program runs I get the following.
$ ./messageformater.py
/Common/([\d]{1,}\.[\d]{1,}\.[\d]{1,}\.[\d]{1,}:[\d]{1,})
no match
when I change re.search to the following it works. The regex will be obtained from the file and may not be the same every time. That is why I am trying to use a variable.
for line in workingconf:
regexsearch = re.search(r'regex>>>(.+)', line)
if regexsearch:
regex = regexsearch.group(1)
print regex
alertforsrch = re.search(r'/Common/([\d]{1,}\.[\d]{1,}\.[\d]{1,}\.[\d]{1,}:[\d]{1,})', data)
if alertforsrch:
print "MATCH"
print alertforsrch.group(1)
else:
print "no match"
####### Results ########
$./messageformater.py
/Common/([\d]{1,}\.[\d]{1,}\.[\d]{1,}\.[\d]{1,}:[\d]{1,})
MATCH
10.10.10.10:10
Works fine for me...
Why even bother with the string formatter though? re.search(regex, data) should work fine.
You may have a newline character at the end of the regex read in from the file - try re.search(regex.strip(), data)
I know I'm an idiot, but I can't pull the domain out of this email address:
'blahblah#gmail.com'
My desired output:
'#gmail.com'
My current output:
.
(it's just a period character)
Here's my code:
import re
test_string = 'blahblah#gmail.com'
domain = re.search('#*?\.', test_string)
print domain.group()
Here's what I think my regular expression says ('#*?.', test_string):
' # begin to define the pattern I'm looking for (also tell python this is a string)
# # find all patterns beginning with the at symbol ("#")
* # find all characters after ampersand
? # find the last character before the period
\ # breakout (don't use the next character as a wild card, us it is a string character)
. # find the "." character
' # end definition of the pattern I'm looking for (also tell python this is a string)
, test string # run the preceding search on the variable "test_string," i.e., 'blahblah#gmail.com'
I'm basing this off the definitions here:
http://docs.activestate.com/komodo/4.4/regex-intro.html
Also, I searched but other answers were a bit too difficult for me to get my head around.
Help is much appreciated, as usual. Thanks.
My stuff if it matters:
Windows 7 Pro (64 bit)
Python 2.6 (64 bit)
PS. StackOverflow quesiton: My posts don't include new lines unless I hit "return" twice in between them. For example (these are all on a different line when I'm posting):
# - find all patterns beginning with the at symbol ("#")
* - find all characters after ampersand
? - find the last character before the period
\ - breakout (don't use the next character as a wild card, us it is a string character)
. - find the "." character
, test string - run the preceding search on the variable "test_string," i.e., 'blahblah#gmail.com'
That's why I got a blank line b/w every line above. What am I doing wrong? Thx.
Here's something I think might help
import re
s = 'My name is Conrad, and blahblah#gmail.com is my email.'
domain = re.search("#[\w.]+", s)
print domain.group()
outputs
#gmail.com
How the regex works:
# - scan till you see this character
[\w.] a set of characters to potentially match, so \w is all alphanumeric characters, and the trailing period . adds to that set of characters.
+ one or more of the previous set.
Because this regex is matching the period character and every alphanumeric after an #, it'll match email domains even in the middle of sentences.
Ok, so why not use split? (or partition )
"#"+'blahblah#gmail.com'.split("#")[-1]
Or you can use other string methods like find
>>> s="bal#gmail.com"
>>> s[ s.find("#") : ]
'#gmail.com'
>>>
and if you are going to extract out email addresses from some other text
f=open("file")
for line in f:
words= line.split()
if "#" in words:
print "#"+words.split("#")[-1]
f.close()
Using regular expressions:
>>> re.search('#.*', test_string).group()
'#gmail.com'
A different way:
>>> '#' + test_string.split('#')[1]
'#gmail.com'
You can try using urllib
from urllib import parse
email = 'myemail#mydomain.com'
domain = parse.splituser(email)[1]
Output will be
'mydomain.com'
Just wanted to point out that chrisaycock's method would match invalid email addresses of the form
herp#
to correctly ensure you're just matching a possibly valid email with domain you need to alter it slightly
Using regular expressions:
>>> re.search('#.+', test_string).group()
'#gmail.com'
Using the below regular expression you can extract any domain like .com or .in.
import re
s = 'my first email is user1#gmail.com second email is enter code hereuser2#yahoo.in and third email is user3#outlook.com'
print(re.findall('#+\S+[.in|.com|]',s))
output
['#gmail.com', '#yahoo.in']
Here is another method using the index function:
email_addr = 'blahblah#gmail.com'
# Find the location of # sign
index = email_addr.index("#")
# extract the domain portion starting from the index
email_domain = email_addr[index:]
print(email_domain)
#------------------
# Output:
#gmail.com