So I know the setpoints <start point> and <end point> in the text file and I need to use these to find certain information between them which will be used and printed. I currently have .readlines() within a different function which is used within the new function to find the information.
You can try something like this:
flag = False
info = [] # your desired information will be appended as a string in list
with open(your_file, 'r') as file:
for line in file.readlines():
if '<start point>' in line: # Pointer reached the start point
flag = True
if '<end point>' in line: # Pointer reached the end point
flag = False
if flag: # this line is between the start point and endpoint
info.append(line)
>> info
['Number=12', 'Word=Hello']
This seems like a job for regular expressions. If you have not yet encountered regular expressions, they are an extremely powerful tool that can basically be used to search for a specific pattern in a text string.
For example the regular expression (or regex for short) Number=\d+ would find any line in the text document that has Number= followed by any number of number characters. The regex Word=\w+ would match any string starting with Word= and then followed by any number of letters.
In python you can use regular expression through the re module. For a great introduction to using regular expressions in python check out this chapter from the book Automate the Boring Stuff with Python. To test out regular expressions this site is great.
In this particular instance you would do something like:
import re
your_file = "test.txt"
with open(your_file,'r') as file:
file_contents = file.read()
number_regex = re.compile(r'Number=\d+')
number_matches = re.findall(number_regex, file_contents)
print(number_matches)
>>> ['Number=12']
This would return a list with all matches to the number regex. You could then do the same thing for the word match.
Related
This question already has answers here:
Python: consecutive lines between matches similar to awk
(3 answers)
Closed 4 years ago.
I am parsing log files that include lines regarding events by many jobs, identified by a job id. I am trying to get all lines in a log file between two patterns in Python.
I have read this very useful post How to select lines between two patterns? and had already solved the problem with awk like so:
awk '/pattern1/,/pattern2/' file
Since I am processing the log information in a Python script, I am using subprocess.Popen() to execute that awk command. My program works, but I would like to solve this using Python alone.
I know of the re module, but don't quite understand how to use it. The log files have already been compressed to bz2, so this is my code to open the .bz2 files and find the lines between the two patterns:
import bz2
import re
logfile = '/some/log/file.bz2'
PATTERN = r"/{0}/,/{1}/".format('pattern1', 'pattern2')
# example: PATTERN = r"/0001.server;Considering job to run/,/0040;pbs_sched;Job;0001.server/"
re.compile(PATTERN)
with bz2.BZ2File(logfile) as fh:
match = re.findall(PATTERN, fh.read())
However, match is empty (fh.read() is not!). Using re.findall(PATTERN, fh.read(), re.MULTILINE) has no effect.
Using re.DEBUG after re.compile() shows many lines with
literal 47
literal 50
literal 48
literal 49
literal 57
and two say
any None
I could solve the problem with loops like here python print between two patterns, including lines containing patterns but I avoid nested for-if loops as much as I can. I belive the re module can yield the result I want but I am no expert in how to use it.
I am using Python 2.7.9.
It's usually a bad idea to read a whole log file into memory, so I'll give you a line-by-line solution. I'll assume that the dots you have in your example are the only varying part of the pattern. I'll also assume that you want to collect line groups in a list of lists.
import bz2
import re
with_delimiting_lines = True
logfile = '/some/log/file.bz2'
group_start_regex = re.compile(r'/0001.server;Considering job to run/')
group_stop_regex = re.compile(r'/0040;pbs_sched;Job;0001.server/')
group_list = []
with bz2.BZ2File(logfile) if logfile.endswith('.bz2') else open(logfile) as fh:
inside_group = False
for line_with_nl in fh:
line = line_with_nl.rstrip()
if inside_group:
if group_stop_regex.match(line):
inside_group = False
if with_delimiting_lines:
group.append(line)
group_list.append(group)
else:
group.append(line)
elif group_start_regex.match(line):
inside_group = True
group = []
if with_delimiting_lines:
group.append(line)
Please note that match() matches from the beginning of the line (as if the pattern started with ^, when re.MULTILINE mode is off)
/pattern1/,/pattern2/ isn't a regex, it's a construct specific to awk which is composed of two regexs.
With pure regex you could use pattern1.*?pattern2 with the DOTALL flag (which makes . match newlines when it usually wouldn't) :
re.findall("pattern1.*?pattern2", input, re.DOTALL)
It differs from the awk command which will match the full lines containing the start and end pattern ; this could be achieved as follows :
re.findall("[^\n]*pattern1.*?pattern2[^\n]*", input, re.DOTALL)
Try it here !
Note that I answered your question as it was asked for the sake of pedagogy, but Walter Tross' solution should be preferred.
I am trying to complete a "Regex search" project from the book Automate boring stuff with python. I tried searching for answer, but I failed to find related thread in python.
The task is: "Write a program that opens all .txt files in a folder and searches for any line that matches a user-supplied regular expression. The results should be printed to the screen."
With the below compile I manage to find the first match
regex = re.compile(r".*(%s).*" % search_str)
And I can print it out with
print(regex.search(content).group())
But if I try to use
print(regex.findall(content))
The output is only the inputted word/words, not the whole line they are on. Why won't findall match the whole line, even though that is how I compiled the regex?
My code is as follows.
# Regex search - Find user given text from a .txt file
# and prints the line it is on
import re
# user input
print("\nThis program searches for lines with your string in them\n")
search_str = input("Please write the string you are searching for: \n")
print("")
# file input
file = open("/users/viliheikkila/documents/kooditreeni/input_file.txt")
content = file.read()
file.close()
# create regex
regex = re.compile(r".*(%s).*" % search_str)
# print out the lines with match
if regex.search(content) is None:
print("No matches was found.")
else:
print(regex.findall(content))
In python regex, parentheses define a capturing group. (See here for breakdown and explanation).
findall will only return the captured group. If you want the entire line, you will have to iterate over the result of finditer.
I am trying to read a bunch of regexes from a file, using python.
The regexes come in a file regexes.csv, a pair in each line, and the pair is separated by commas. e.g.
<\? xml([^>]*?)>,<\? XML$1>
peter,Peter
I am doing
detergent = []
infile = open('regexes.csv', 'r')
for line in infile:
line = line.strip()
[search_term, replace_term] = line.split(',', 1)
detergent += [[search_term,replace_term]]
This is not producing the right input. If I print the detergent I get
['<\\?xml([^>]*?)>', '<\\?HEYXML$1>'],['peter','Peter']]
It seems to be that it is escaping the backslashes.
Moreover, in a file containing, say
<? xml ........>
a command re.sub(search_term,replace_term,file_content) written further below in the content is replacing it to be
<\? XML$1>
So, the $1 is not recovering the first capture group in the first regex of the pair.
What is the proper way to input regexes from a file to be later used in re.sub?
When I've had the regexes inside the script I would write them inside the r'...', but I am not sure what are the issues at hand when reading form a file.
There are no issues or special requirements for reading regex's from a file. The escaping of backslashes is simply how python represents a string containing them. For example, suppose you had defined a regex as rgx = r"\?" directly in your code. Try printing it, you'll see it is displayed the same way ...
>>> r"\?"
>>> '\\?'
The reason you $1 is not being replaced is because this is not the syntax for group references. The correct syntax is \1.
I would like to be able to search a dictionary in Python using user input wildcards.
I have found this:
import fnmatch
lst = ['this','is','just','a','test', 'thing']
filtered = fnmatch.filter(lst, 'th*')
This matches this and thing. Now if I try to input a whole file and search through
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower()
filtered = fnmatch.filter(file_contents, 'th*')
this doesn't match anything. The difference is that in the file that I am reading from I is a text file (Shakespeare play) so I have spaces and it is not a list. I can match things such as a single letter, so if I just have 't' then I get a bunch of t's. So this tells me that I am matching single letters - I however am wanting to match whole words - but even more, to preserve the wildcard structure.
Since what I would like to happen is that a user enters in text (including what will be a wildcard) that I can substitute it in to the place that 'th*' is. The wild card would do what it should still. That leads to the question, can I just stick in a variable holding the search text in for 'th*'? After some investigation I am wondering if I am somehow supposed to translate the 'th*' for example and have found something such as:
regex = fnmatch.translate('th*')
print(regex)
which outputs th.*\Z(?ms)
Is this the right way to go about doing this? I don't know if it is needed.
What would be the best way in going about "passing in regex formulas" as well as perhaps an idea of what I have wrong in the code as it is not operating on the string of incoming text in the second set of code as it does (correctly) in the first.
If the problem is just that you "have spaces and it is not a list," why not make it into a list?
with open('testfilefolder/wssnt10.txt') as f:
file_contents = f.read().lower().split(' ') # split line on spaces to make a list
filtered = fnmatch.filter(file_contents, 'th*')
Currently, I am using a regular expression to search for a pattern of numbers in a log file. I also want to add another search capability, general user submitted ascii string search and print out the line number. This is what I have and trying work around (help is appreciated):
logfile = open("13.00.log", "r")
searchString = raw_input("Enter search string: ")
for line in logfile:
search_string = searchString.findall(line)
for word in search_string:
print word #ideally would like to create and write to a text file
First of all, strings don't have a findall method -- I don't know where you got that. Second, why use a string method or regex at all? For a simple string search of the kind you're describing, in is sufficient, as in if search_string in line:. To get line numbers, a quick solution is the enumerate built-in function: for line_number, line in enumerate(logfile):.
Your code seems fairly fragmented. Psuedocode would look something like
get_search_string
for line, line_no in logfile:
if search_string in line:
do output with line_no