For my work i'm used to work with matlab. No i try to learn the basic skills of python as well. Currently I'm working on on the following excersise:
You are interested in extracting all of the occurrences that look like
this
<Aug22-2008> <15:37:37> Bond Energy LDA -17.23014168 eV
In particular, you want to gather the numerical values (eg,
-17.23014168), and print them out. Write a script that reads the output file from standard input, and uses regular expressions to
locate the values you want to extract. Have your script print out all
the values to standard output.
This is the code I use:
import os,re
from string import rjust
dataEx=re.compile(r'''
^\s*
<Aug22-2008>
\s+
<\d{2}:\d{2}:\d{2}>
\s+
Bond
\s
Energy
\s
LDA
\s+
((\+|-)?(\d*)\.?\d*)
''',re.VERBOSE)
f=open('Datafile_Q2.txt','r')
line = f.readline()
while line != '':
line = f.readline() # Get next line
m = dataEx.match(line)
if m:
# print line
print m.group(1)
With this code I'm able to find all values in the datafile they ask for. However I do have a few questions. Firstly, they ask specific something about stdin and stdout. No I'm wondering do I use the right code to read the output file from standard input and do I really print out all the values to standard output in this way? Futhermore, I'm wondering whether there is a better or more easy way to find the required values?
To find the numbers your looking for I would use a positive lookbehind and lookahead function in your regular expression.
(?<=Bond Energy LDA ).*(?= eV)
This checks to see if the thing you are looking at is proceeded by 'Bond Energy LDA' and followed by 'eV' but does not include them in the string you extract. So assuming that the numbers you are looking for are always proceeded and followed by those two things you can find them like that.
A nice way to read from stdin is to use the sys python module.
import sys
Then you can read lines straight from stdin:
import sys
import re
from line in sys.stdin:
matchObj = re.search(r '(?<=Bond Energy LDA ).*(?= eV)', line, re.I)
if(matchObj):
print(matchObj.group())
If the regular expression is not found on the line then matchObj will be null skipping the if statement. If it is found the search will return a matchObj containing groups. You can then print the group to stdout as print will by default print to stdout if no file is given.
Why use a regular expression? Split the input:
>>> s = """<Aug22-2008> <15:37:37> Bond Energy LDA -17.23014168 eV"""
>>> s.split()[5]
'-17.23014168'
Of course, if you can provide more sample input that does not put the number on the 5th position, this perhaps is not enough.
Ask your teacher for more sample input.
STDIN and STDOUT are documented.
If you want to use regex you may use:
(?:<.*>\W+)[a-zA-Z ]+([-+]?[0-9]*\.?[0-9]+)
Demo
Related
I am new to regex so please explain how you got to the answer. Anyway I want to know the best way to match input function from a separate python file.
For example:
match.py
a = input("Enter a number")
b = input()
print(a+b)
Now I want to match ONLY the input statement and replace it with a random number. I will do this in a separate file main.py. So my aim is to replace input function in the match.py with a random numbers so I can check the output will come as expected. You can think of match.py like a coding exercise where he writes the code in that file and main.py will be the file where it evaluates if the users code is right. And to do that I need to replace the input myself and check if it works for all kinds of inputs. I looked for "regex patterns for python input function" but the search did not work right. I have a current way of doing it but I don't think it works in all kinds of cases. I need a perfect pattern which works in all kinds of cases referring to the python syntax. Here is the current main.py I have (It doesn't work for all cases I mean when you write a string with single quote, it does not replace but here is the problem I can just add single quote in pattern but I also need to detect if both are used):
# Evaluating python file checking if input 2 numbers and print sum is correct
import re
import subprocess
input_pattern = re.compile(r"input\s?\([\"]?[\w]*[\"]?\)")
file = open("match.py", 'r')
read = file.read()
file.close()
code = read
matches = input_pattern.findall(code)
for match in matches:
code = code.replace(match, '8')
file = open("match.py", 'w')
file.write(code)
file.close()
process = subprocess.Popen('python3 match.py', stdout=subprocess.PIPE, stderr=subprocess.PIPE, shell=True)
out = process.communicate()[0]
print(out == b"16\n")
file = open("match.py", 'w')
file.write(read)
file.close()
Please let me know if you don't understand this question.
The following regex statement is very close to what you need:
input\s?\((?(?=[\"\'])[\"\'].*[\"\']\)|\))
I am using a conditional regex statement. However, I think it may need a nested conditional to avoid the situation that the user enters something like:
input(' text ")
But hopefully this gets you on the right track.
This question already has answers here:
Python: consecutive lines between matches similar to awk
(3 answers)
Closed 4 years ago.
I am parsing log files that include lines regarding events by many jobs, identified by a job id. I am trying to get all lines in a log file between two patterns in Python.
I have read this very useful post How to select lines between two patterns? and had already solved the problem with awk like so:
awk '/pattern1/,/pattern2/' file
Since I am processing the log information in a Python script, I am using subprocess.Popen() to execute that awk command. My program works, but I would like to solve this using Python alone.
I know of the re module, but don't quite understand how to use it. The log files have already been compressed to bz2, so this is my code to open the .bz2 files and find the lines between the two patterns:
import bz2
import re
logfile = '/some/log/file.bz2'
PATTERN = r"/{0}/,/{1}/".format('pattern1', 'pattern2')
# example: PATTERN = r"/0001.server;Considering job to run/,/0040;pbs_sched;Job;0001.server/"
re.compile(PATTERN)
with bz2.BZ2File(logfile) as fh:
match = re.findall(PATTERN, fh.read())
However, match is empty (fh.read() is not!). Using re.findall(PATTERN, fh.read(), re.MULTILINE) has no effect.
Using re.DEBUG after re.compile() shows many lines with
literal 47
literal 50
literal 48
literal 49
literal 57
and two say
any None
I could solve the problem with loops like here python print between two patterns, including lines containing patterns but I avoid nested for-if loops as much as I can. I belive the re module can yield the result I want but I am no expert in how to use it.
I am using Python 2.7.9.
It's usually a bad idea to read a whole log file into memory, so I'll give you a line-by-line solution. I'll assume that the dots you have in your example are the only varying part of the pattern. I'll also assume that you want to collect line groups in a list of lists.
import bz2
import re
with_delimiting_lines = True
logfile = '/some/log/file.bz2'
group_start_regex = re.compile(r'/0001.server;Considering job to run/')
group_stop_regex = re.compile(r'/0040;pbs_sched;Job;0001.server/')
group_list = []
with bz2.BZ2File(logfile) if logfile.endswith('.bz2') else open(logfile) as fh:
inside_group = False
for line_with_nl in fh:
line = line_with_nl.rstrip()
if inside_group:
if group_stop_regex.match(line):
inside_group = False
if with_delimiting_lines:
group.append(line)
group_list.append(group)
else:
group.append(line)
elif group_start_regex.match(line):
inside_group = True
group = []
if with_delimiting_lines:
group.append(line)
Please note that match() matches from the beginning of the line (as if the pattern started with ^, when re.MULTILINE mode is off)
/pattern1/,/pattern2/ isn't a regex, it's a construct specific to awk which is composed of two regexs.
With pure regex you could use pattern1.*?pattern2 with the DOTALL flag (which makes . match newlines when it usually wouldn't) :
re.findall("pattern1.*?pattern2", input, re.DOTALL)
It differs from the awk command which will match the full lines containing the start and end pattern ; this could be achieved as follows :
re.findall("[^\n]*pattern1.*?pattern2[^\n]*", input, re.DOTALL)
Try it here !
Note that I answered your question as it was asked for the sake of pedagogy, but Walter Tross' solution should be preferred.
So I know the setpoints <start point> and <end point> in the text file and I need to use these to find certain information between them which will be used and printed. I currently have .readlines() within a different function which is used within the new function to find the information.
You can try something like this:
flag = False
info = [] # your desired information will be appended as a string in list
with open(your_file, 'r') as file:
for line in file.readlines():
if '<start point>' in line: # Pointer reached the start point
flag = True
if '<end point>' in line: # Pointer reached the end point
flag = False
if flag: # this line is between the start point and endpoint
info.append(line)
>> info
['Number=12', 'Word=Hello']
This seems like a job for regular expressions. If you have not yet encountered regular expressions, they are an extremely powerful tool that can basically be used to search for a specific pattern in a text string.
For example the regular expression (or regex for short) Number=\d+ would find any line in the text document that has Number= followed by any number of number characters. The regex Word=\w+ would match any string starting with Word= and then followed by any number of letters.
In python you can use regular expression through the re module. For a great introduction to using regular expressions in python check out this chapter from the book Automate the Boring Stuff with Python. To test out regular expressions this site is great.
In this particular instance you would do something like:
import re
your_file = "test.txt"
with open(your_file,'r') as file:
file_contents = file.read()
number_regex = re.compile(r'Number=\d+')
number_matches = re.findall(number_regex, file_contents)
print(number_matches)
>>> ['Number=12']
This would return a list with all matches to the number regex. You could then do the same thing for the word match.
All,
I am relatively new to Python but have used other scripting languages with REGEX extensively. I need a script that will open a file, look for a REGEX pattern, replace the pattern and close the file. I have found that the below script works great, however, I dont know if the "for line in fileinput.input" command can accomodate for a regex pattern that exceeds a single line (i.e. the regex includes a carriage return). In my instance, it covers 2 lines. My test file read_it.txt looks like this
read_it.txt (contains just 3 lines)
ABA
CDC
EFE
The script is designed to open the file, recognize the pattern ABA\nCDC that is seen over 2 lines, then replace it with the word TEST.
If the pattern replace is successful, then the file should read as follows and contain now only 2 lines:
TEST
EFE
Knowing the answer to this will help greatly in using Python scripts to parse text files and modify them on the fly. I believe, but am not sure, that there may be a better Python construct that still allows for REGEX searches. So the question is:
1) Do I need to change something in the existing script that would change the behavior of the "for line" command to match a multi-line REGEX pattern?
2) Or do I need a different Python script that is better suited to a multi-line search?
Some things that may help but I currently dont know how to write them are:
1) fileinput "readline" option.
2) adding (?m) in the expression for multline
Please help!
Brent
SCRIPT
import sys
import fileinput
import re
for line in fileinput.input('C:\\Python34\\read_it.txt', inplace=1):
line = re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip())
print(line)
2) adding (?m) in the expression for multline
You can do this by adding re.M or flags=re.MULTILINE as an argument in re.sub
Example:-
re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip(), re.M)
or
re.sub(r'A(B)A$\nCDC', r'TEST', line.rstrip(), flags=re.MULTILINE)
I am looking for a way to match the beginning of a line to a regex and for the line to be returned afterwards. The set is quite extensive hence why I cannot simply use the method given on Python regular expressions matching within set. I was also wondering if regex is the best solution. I have read the http://docs.python.org/3.3/library/re.html alas, it does not seem to hold the answer. Here is what I have tried so far...
import re
import os
import itertools
f2 = open(file_path)
unilist = []
bases=['A','G','C','N','U']
patterns= set(''.join(per) for per in itertools.product(bases, repeat=5))
#stuff
if re.match(r'.*?(?:patterns)', line):
print(line)
unilist.append(next(f2).strip())
print (unilist)
You see, the problem is that I do not know how to refer to my set...
The file I am trying to match it to looks like:
#SRR566546.970 HWUSI-EAS1673_11067_FC7070M:4:1:2299:1109 length=50 TTGCCTGCCTATCATTTTAGTGCCTGTGAGGTGGAGATGTGAGGATCAGT
+
hhhhhhhhhhghhghhhhhfhhhhhfffffeee[X]b[d[ed`[Y[^Y
You are going about it the wrong way.
You simply leave the set of characters to the regular expression:
re.search('[AGCNU]{5}', line)
matches any 5-character pattern built from those 5 characters; that matches the same 3125 different combinations you generated with your set line, but doesn't need to build all possible combinations up front.
Otherwise, your regular expression attempt had no correlation to your patterns variable, the pattern r'.*?(?:patterns)' would match 0 or more arbitrary characters, followed by the literal text 'patterns'.
According to what I've understood from your question, it seems to me that this could fit your need:
import re
sss = '''dfgsdfAUGNA321354354
!=**$=)"nNNUUG54788
=AkjhhUUNGffdffAAGjhff1245GGAUjkjdUU
.....cv GAUNAANNUGGA'''
print re.findall('^(.+?[AGCNU]{5})',sss,re.MULTILINE)