reading the line after re.match

reading the line after re.match - python

I'm trying to read xyz coordinates from a long file using python.
within the file there is a block which indicates that the xyz coordinates are within the next lines.
CARTESIAN COORDINATES (ANGSTROEM)
---------------------------------
C -0.283576 -0.776740 -0.312605
H -0.177080 -0.046256 -1.140653
Cl -0.166557 0.025928 1.189976
----------------------------
I'm using the following code to find the line which mentions the "CARTESIAN COORDINATES (ANGSTROEM)" and then try to iterate until finding an empty line to read the coordinates. However, f.tell() points that I'm at line 0! Therefore, I can not do either next(f) or f.readline() to go through the next lines (just goes to line 1 from line 0). I don't know how this can be done with python.
def read_xyz_out(self,out):
atoms = []
x = []
y = []
z = []
f = open(out, "r")
for line in open(out):
if re.match(r'{}'.format(r'CARTESIAN COORDINATES \(ANGSTROEM\)'), line):
print(f.tell())
# data = line.split()
# atoms.append(data[0])
# x.append(float(data[1]))
# y.append(float(data[2]))
# z.append(float(data[3]))

Suppose you read your file into this string:
My dog has fleas.
CARTESIAN COORDINATES (ANGSTROEM)
---------------------------------
C -0.283576 -0.776740 -0.312605
H -0.177080 -0.046256 -1.140653
Cl -0.166557 0.025928 1.189976
----------------------------
My cat too.
You can then extract lines 4, 5 and 6 with the regular expression
/CARTESIAN COORDINATES \(ANGSTROEM\)\r?\n---------------------------------\r?\n(.+?)(?=\r?\n\r?\n)/s
demo
This expression reads, "match the string 'CARTENSION...---\r?\n' followed by matching 1+ chars, greedily, in capture group 1, followed by an empty line, with the flag '/s' to enable '.' to match the ends of lines".
The desired information can then be extracted with the regular expression
/ *([A-Z][a-z]*) +(-?\d+.\d{6}) +(-?\d+.\d{6}) +(-?\d+.\d{6})\r?\n/
demo
The first step can be skipped if it is sufficient to look for a line that look like this:
C -0.283576 -0.776740 -0.312605
without having to confirm it is preceded by "CARTESIAN...---".
demo

You've opened out twice: once for the f variable and a second time for the for line in open(out): loop. Each file object has its own position, and you've only been reading from the second one (which hasn't been assigned to a variable so you can't get the position). The position of f is still at the beginning, since you never read from it.
You should use
for line in f:
and not call open(out) a second time. You can then call f.readline() inside the loop to read more lines of the file.

How about this (note: untested so there's bound to be bugs - think of this as a sketch of a solution):
def read_xyz_out(self,out):
atoms = []
x = []
y = []
z = []
f = open(out, "r")
# Read until you get to the data
for line in f:
if re.match(r'{}'.format(r'CARTESIAN COORDINATES \(ANGSTROEM\)'), line):
# skip the next line too
f.readline()
break
# Now you're into the data - the loop here picks up where the previous
# one left off
for line in f:
data = line.split()
atoms.append(data[0])
x.append(float(data[1]))
y.append(float(data[2]))
z.append(float(data[3]))
f.close()

Related

How to load .txt file as a numpy array such that it only reads in certain lines?

I have a text file that contains xyz coordinates broken up by lines of text (specifically, the first 2 lines are text, then the next 22 are coordinates, and the next 2 are text, etc. It goes on like that for the rest of the file). I want to read in the file such that it will be a numpy array (or list, either works) that contains all the different sets of coordinates in separate lists/arrays.
So:
[[x1 y1 z1],[x2 y2 z2],...]
Here is what I have tried:
def convert_xyz_bat(filename, newfile): #add self later
with open(filename, "r") as f:
coords = []
for line in f:
if "C" in line or "H" in line:
atom,x,y,z = line.split(" ")
coords.append([float(x), float(y), float(z)])
else:
pass
coordinates = np.array(coords, dtype=object)
return print(coordinates[0])
This takes up a lot of memory since it writes all the lines to this variable (the file is really large). I'm not sure if this will use less memory or not, but I could also do something like this, where I make another file which contains all the coordinates:
with open(filename, "r") as f:
with open(newfile, "r+") as f1:
for line in f:
if "C" in line or "H" in line:
atom, x,y,z = line.split(" ")
f1.write(str([float(x), float(y), float(z)]))
else:
pass
return
If I make the file, the problem with that is it only lets me write the coordinates in as strings, so I would have to define a variable that opens the file and writes it in as an array (so that I can use indexing later with loop functions).
I am not sure which option would work better, or if there is a better third/fourth option that I have not considered.

you have some typos in your first code. return print() is weird combination and some indentation problem near the with statement.
as mentioned your second option will have less memory consumption, however the data will be reachable on demand.
I think that you need to rethink what is your main target. if you just want to cast the data between different formats from file to file the second option is better. If you need to apply some logic on the data the first option (with high memory consumption) is the solution. You can also do something else, instead of reading all the data try to read it as chunks and work your way thru the file. Something like:
class ReadFile:
def __init__(self, file_path):
self.file_pipe = open(file_path, "r")
self.number_of_lines_to_read = 1000
def __del__(self):
self.file_pipe.close()
def get_next_cordinates(self):
cnt = 0
coords = []
for line in self.file_pipe:
cnt += 1
if cnt % self.number_of_lines_to_read == 0:
yield np.array(coords, dtype=object)
coords = []
if "C" in line or "H" in line:
atom, x, y, z = line.split(" ")
coords.append([float(x), float(y), float(z)])
else:
pass
yield np.array(coords, dtype=object)
and than you can use it as follows:
read_file = ReadFile(file_path)
for coords in read_file.get_next_cordinates():
# do something with the coords
pass

Charting values using python

I have a log file which shows data sent in the below format -
2019-10-17T00:00:02|Connection(10.0.0.89 :0 ) r=0 s=1024
d=0 t=0 q=0 # connected
2019-10-17T00:00:02|McSend (229.0.0.70 :20001) b=1635807
f=2104 d=0 t=0
There will be multiple lines per file
How can I graph the b=value against the time (near the beginning on the line) but only from the McSend lines
Thanks

If you're not familiar with regular expressions - python regex documentation is a good place to start.
The simplest regex you probably need is r"^(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d)\|.*McSend.*+b=(\d+)"
First group will allow you compare the timestamp and the second will give the value.
import re
pattern = r"^(\d\d\d\d-\d\d-\d\dT\d\d:\d\d:\d\d)\|.+McSend.+b=(\d+)"
#result is a list of tuples containing the time stamp and the value for b
result = re.findall(pattern, some_input)

You should read your file line by lines. Then scan for each line if it contains 'McSend'. If it does then retrieve the desired data.
You could do something like this :
b_values = []
dates = []
## Lets open the file and read it line by line
with open(filepath) as f:
for line in f:
## If the line contains McSend
if 'McSend' in line :
## We split the line by spaces ( split() with no arguments does so )
splited_line = line.split()
## First string chunk contains the header where the date is located
header = splited_line[0]
## Then retrieve the b value
for val in splited_line :
if val.startswith('b=') :
b_value = val.split("=",1)[1]
## Now you can add the value to arrays and then plot what you neet
b_values.append(b_value)
dates.append(header.split("|",1)[0]
## Do your plot

extract the dimensions from the head lines of text file

Please see following attached image showing the format of the text file. I need to extract the dimensions of data matrix indicated by the first line in the file, here 49 * 70 * 1 for the case shown by the image. Note that the length of name "gd_fac" can be varying. How can I extract these numbers as integers? I am using Python 3.6.

Specification is not very clear. I am assuming that the information you want will always be in the first line, and always be in parenthesis. After that:
with open(filename) as infile:
line = infile.readline()
string = line[line.find('(')+1:line.find(')')]
lst = string.split('x')
This will create the list lst = [49, 70, 1].
What is happening here:
First I open the file (you will need to replace filename with the name of your file, as a string. The with ... as ... structure ensures that the file is closed after use. Then I read the first line. After that. I select only the parts of that line that fall after the open paren (, and before the close paren ). Finally, I break the string into parts, with the character x as the separator. This creates a list that contains the values in the first line of the file, which fall between parenthesis, and are separated by x.

Since you have mentioned that length of 'gd_fac' van be variable, best solution will be using Regular Expression.
import re
with open("a.txt") as fh:
for line in fh:
if '(' in line and ')' in line:
dimension = re.findall(r'.*\((.*)\)',line)[0]
break
print dimension
Output:
'49x70x1'

What this does is it looks for "gd_fac"
then if it's there is removes all the unneeded stuff and replaces it with just what you want.
with open('test.txt', 'r') as infile:
for line in infile:
if("gd_fac" in line):
line = line.replace("gd_fac", "")
line = line.replace("x", "*")
line = line.replace("(","")
line = line.replace(")","")
print (line)
break
OUTPUT: "49x70x1"

how to detect the third line before certain line

I have a file, like this:
<prop type="ltattr-match">1-1</prop>
id =>3</prop>
<tuv xml:lang="en">
<seg> He is not a good man </seg>
And what I want is to detect the third line before the line He is not a good man , i.e (id =>3). The file is big. what I can do

I suggest using a double ended queue with a maximum length: this way, only the required amount of "backlog" is stored and you don't have to fiddle around with slices manually. We don't need the "double-ended-ness", but the normal Queue class blocks if the queue is full.
import collections
dq = collections.deque([], 3) # create an empty queue
with open("mybigfile.txt") as file:
for line in file.readlines():
if line.startswith('<seg>'):
return dq[0] # or add to list
dq.append(line) # save the line, if already 3 lines stored,
# discard oldest line.

with open("mybigfile.txt") as file:
lines = file.readlines()
for idx, line in enumerate(lines):
if line.startswith("<seg>"):
line_to_detect = lines[idx-3]
#use idx-2 if you want the _second_ line before this one,
#ex `id =>3</prop>`
print "This line was detected:"
print line_to_detect
Result:
This line was detected:
<prop type="ltattr-match">1-1</prop>
As we previously discussed in chat, this method can be memory intensive for very large files. But 100 pages isn't very large, so this should be fine.

Read each line in sequence, remembering only the last 3 read at any point.
Something like:
# Assume f is a file object open to your file
last3 = []
last3.append( f.readline() )
last3.append( f.readline() )
last3.append( f.readline() )
while ( True ):
line = f.readline()
if (line satisfies condition):
break
last3 = last3[1:]+[line]
# At this point last3[0] is 3 lines before the matching line
You'll need to modify this to handle files w/ < 3 lines, or if no line matches your condition.

file = "path/to/the/file"
f = open(file, "r")
lines = f.readlines()
f.close()
i = 0
for line in lines:
if "<seg> He is not a good man </seg>" in line:
print(lines[i]) #Print the prvious line
else
i += 1
If you need the second line before just change to print(lines[i-1])

set current position in a text file one line back

I want to set the current position in a textfile one line back.
Example:
I search in a textfile for a word "x".
Textfile:
Line: qwe qwe
Line: x
Line: qwer
Line: qwefgdg
If i find that word, the current position of the fobj shall be set back one line.
( in the example I find the word in the 2. Line so the position shall be set to the beginning of the 1. Line)
I try to use fseek. But I wasn't that succesfull.

This is not how you do it in Python. You should just iterate over the file, test the current line and never worry about file pointers. If you need to retrieve the content of the previous line, just store it.
>>> with open('text.txt') as f: print(f.read())
a
b
c
d
e
f
>>> needle = 'c\n'
>>> with open('test.txt') as myfile:
previous = None
position = 0
for line in myfile:
if line == needle:
print("Previous line is {}".format(repr(previous)))
break
position += len(line) if line else 0
previous = line
Previous line is 'b\n'
>>> position
4
If you really need the byte position of the previous line, be aware that the tell/seek methods don't blend well with iteration, so reopen the file to be safe.

f = open('filename').readlines()
i = 0
while True:
if i > range(len(f)-1):
break
if 'x' in f[i]:
i = i - 1
print f[i]
i += 1
Be careful as that will create a forever loop. Make sure you enter an exit condition for loop to terminate.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

reading the line after re.match - python

Related

How to load .txt file as a numpy array such that it only reads in certain lines?

Charting values using python

extract the dimensions from the head lines of text file

how to detect the third line before certain line

set current position in a text file one line back

Categories

Resources