Selecting line from file by using "startswith" and "next" commands - python

I have a file from which I want to create a list ("timestep") from the numbers which appear after each line "ITEM: TIMESTEP" so:
timestep = [253400, 253500, .. etc]
Here is the sample of the file I have:
ITEM: TIMESTEP
253400
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
ITEM: TIMESTEP
253500
ITEM: NUMBER OF ATOMS
378
ITEM: BOX BOUNDS pp pp pp
-2.6943709180241954e-01 5.6240920636804063e+01
-2.8194230631882372e-01 5.8851195163321044e+01
-2.7398090193568775e-01 5.7189372326936599e+01
ITEM: ATOMS id type q x y z
16865 3 0 28.8028 1.81293 26.876
16866 2 0 27.6753 2.22199 27.8362
16867 2 0 26.8715 1.04115 28.4178
16868 2 0 25.7503 1.42602 29.4002
16869 2 0 24.8716 0.25569 29.8897
16870 3 0 23.7129 0.593415 30.8357
16871 3 0 11.9253 -0.270359 31.7252
To do this I tried to use "startswith" and "next" commands at once and it didn't work. Is there other way to do it? I send also the code I'm trying to use for that:
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.split()
if line[0].startswith("ITEM: TIMESTEP"):
timestep.append(next(line))
print(timestep)

The logic is to decide whether to append the current line to timestep or not. So, what you need is a variable which tells you append the current line when that variable is TRUE.
timestep = []
append_to_list = False # decision variable
with open(file, 'r') as f:
lines = f.readlines()
for line in lines:
line = line.strip() # remove "\n" from line
if line.startswith("ITEM"):
# Update add_to_list
if line == 'ITEM: TIMESTEP':
append_to_list = True
else:
append_to_list = False
else:
# append to list if line doesn't start with "ITEM" and append_to_list is TRUE
if append_to_list:
timestep.append(line)
print(timestep)
output:
['253400', '253500']

First - I don't like this, because it doesn't scale. You can only get the first immediately following line nicely, anything else will be just ugh...
But you asked, so ... for x in lines will create an iterator over lines and use that to keep the position. You don't have access to that iterator, so next will not be the next element you're expecting. But you can make your own iterator and use that:
lines_iter = iter(lines)
for line in lines_iter:
# whatever was here
timestep.append(next(line_iter))
However, if you ever want to scale it... for is not a good way to iterate over a file like this. You want to know what is in the next/previous line. I would suggest using while:
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
i = 0
while i < len(lines):
if line[i].startswith("ITEM: TIMESTEP"):
i += 1
while not line[i].startswith("ITEM: "):
timestep.append(next(line))
i += 1
else:
i += 1
This way you can extend it for different types of ITEMS of variable length.

So the problem with your code is subtle. You have a list lines which you iterate over, but you can't call next on a list.
Instead, turn it into an explicit iterator and you should be fine
timestep = []
with open(file, 'r') as f:
lines = f.readlines()
lines_iter = iter(lines)
for line in lines_iter:
line = line.strip() # removes the newline
if line.startswith("ITEM: TIMESTEP"):
timestep.append(next(lines_iter, None)) # the second argument here prevents errors
# when ITEM: TIMESTEP appears as the
# last line in the file
print(timestep)
I'm also not sure why you included line.split, which seems to be incorrect (in any case line.split()[0].startswith('ITEM: TIMESTEP') can never be true, since the split will separate ITEM: and TIMESTEP into separate elements of the resulting list.)
For a more robust answer, consider grouping your data based on when the line begins with ITEM.
def process_file(f):
ITEM_MARKER = 'ITEM: '
item_title = '(none)'
values = []
for line in f:
if line.startswith(ITEM_MARKER):
if values:
yield (item_title, values)
item_title = line[len(ITEM_MARKER):].strip() # strip off the marker
values = []
else:
values.append(line.strip())
if values:
yield (item_title, values)
This will let you pass in the whole file and will lazily produce a set of values for each ITEM: <whatever> group. Then you can aggregate in some reasonable way.
with open(file, 'r') as f:
groups = process_file(f)
aggregations = {}
for name, values in groups:
aggregations.setdefault(name, []).extend(values)
print(aggregations['TIMESTEP']) # this is what you want

You can use enumerate to help with index referencing. We can check to see if the string ITEM: TIMESTEP is in the previous line then add the integer to our timestep list.
timestep = []
with open('example.txt', 'r') as f:
lines = f.readlines()
for i, line in enumerate(lines):
if "ITEM: TIMESTEP" in lines[i-1]:
timestep.append(int(line.strip()))
print(timestep)

Related

Extract the index of largest number in different lines

I am writing a code for extracting specific lines from my file and then look for the maximum number, more specifically for its position (index).
So I start my code looking for the lines:
with open (filename,'r') as f:
lines = f.readlines()
for index, line in enumerate(lines):
if 'a ' in line:
x=(lines[index])
print(x)
So here from my code I got the lines I was looking for:
a 3 4 5
a 6 3 2
Then the rest of my code is looking for the maximum between the numbers and prints the index:
y = [float(item) for item in x.split()]
z=y.index(max(y[1:3]))
print(z)
now the code finds the index of the two largest numbers (so for 5 in the first line and 6 in the second):
3
1
But I want my code compare also the numbers between the two lines (so largest number between 3,4, 5,6,3,2), to have as output the index of the line, where is in the file the line containing the largest number (for example line 300) and the position in line (1).
Can you suggest to me some possible solutions?
You can try something like that.
max_value - list, where you can get max number, line and position
max_value = [0, 0, 0] # value, line, position
with open(filename, 'r') as f:
lines = f.readlines()
for index, line in enumerate(lines):
if 'a ' in line:
# get line data with digits
line_data = line.split(' ')[1:]
# check if element digit and bigger then max value - save it
for el_index, element in enumerate(line_data):
if element.isdigit() and int(element) > max_value[0]:
max_value = [int(element), index, el_index]
print(max_value)
Input data
a 3 4 5
a 6 3 2
Output data
# 6 - is max, 1 - line, 0 - position
[6, 1, 0]
You should iterate over every single line and keep track of the line number as well as the position of the items in that line all together. Btw you should run this with python 3.9+ (because of .startswith() method.)
with open(filename) as f:
lines = [line.rstrip() for line in f]
max_ = 0
line_and_position = (0, 0)
for i, line in enumerate(lines):
if line.startswith('a '):
# building list of integers for finding the maximum
list_ = [int(i) for i in line.split()[1:]]
for item in list_:
if item > max_:
max_ = item
# setting the line number and position in that line
line_and_position = i, line.find(str(item))
print(f'maximum number {max_} is in line {line_and_position[0] + 1} at index {line_and_position[1]}')
Input :
a 3 4 5
a 6 3 2
a 1 31 4
b 2 3 2
a 7 1 8
Output:
maximum number 31 is in line 3 at index 4
You can do it like below. I commented each line for explanation. This method differs from the others in that: using regex we are getting the current number and it's character position from one source. In other words, there is no going back into the line to find data after-the-fact. Everything we need comes on every iteration of the loop. Also, all the lines are filtered as they are received. Between the 2, having a stack of conditions is eliminated. We end up with 2 loops that get directly to the point and one condition to see if the requested data needs to be updated.
import re
with open(filename, 'r') as f:
#prime data
data = (0, 0, 0)
#store every line that starts with 'a' or blank line if it doesn't
for L, ln in enumerate([ln if ln[0] is 'a' else '' for ln in f.readlines()]):
#get number and line properties
for res in [(int(m.group('n')), L, m.span()[0]) for m in re.compile(r'(?P<n>\d+)').finditer(ln)]:
#compare new number with current max
if res[0] > data[0]:
#store new properties if greater
data = res
#print final
print('Max: {}, Line: {}, Position: {}'.format(*data))

Iterate through indexes of each string of a file of lines of the same length, while making new strings of each value from individual indexes

I have a data file containing a list of sequences, each 6 amino acids long.
As seen below:
QDFRGETW
AQAVRSSS
ANGVELRD
I would like to basically convert this file to:
QAN
DQN
FAG
RVV
GRE
....
WSD
with a simple for loop and while loop.
Here is what I have tried that works.
i2 = ''
with open('datafile','r') as f:
for line in f:
i2 += line[2]
What I would to do is to iterate through the indexes and add each of the new strings to a dictionary. So I decided to try this.
Dict = {}
i = 0
seq = ''
with open ('datafile','r') as f:
while i <= 7:
for line in f:
seq += line[i]
Dict[i] = seq
i += 1
However when I print the Dictionary, it only shows, for example: {0:QAN} and nothing else. If I decrease the indent on Dict[i], it now has all the keys, but the QAN values, instead of 1:DQN etc...
Weirdly, even when I input this code:
seq = ''
i = 0
with open ('datafile','r') as f:
while i <= 7:
for line in f:
seq += line[i]
i += 1
print seq
If also returns the QAN, and not WSD, as I was thinking to expect. Therefore, there is an issue with the while loop. Any thoughts?
The below code should work. input_file.txt is the file containing the text. I think, the first line in expected output must be QAA.
for line in zip(*open('input_file.txt').readlines()):
print(''.join(line))
Output:
QAA
DQN
FAG
RVV
GRE
ESL
TSR
WSD

Find double lines; a faster way

This is what I do to find all double lines in a textfile
import regex #regex is as re
#capture all lines in buffer
r = f.readlines()
#create list of all linenumbers
lines = list(range(1,endline+1))
#merge both lists
z=[list(a) for a in zip(r, lines)]
#sort list
newsorting = sorted(z)
#put doubles in list
listdoubles = []
for i in range(0,len(newsorting)-1):
if (i+1) <= len(newsorting):
if (newsorting[i][0] == newsorting[i+1][0]) and (not regex.search('^\s*$',newsorting[i][0])):
listdoubles.append(newsorting[i][1])
listdoubles.append(newsorting[i+1][1])
#remove event. double linenumbers
listdoubles = list(set(listdoubles))
#sort line numeric
listdoubles = sorted(listdoubles, key=int)
print(listdoubles)
But it is very slow. When I have over 10.000 lines it takes 10 seconds to create this list.
Is there a way to do it faster?
You can use a simpler approach:
for each line
if it has been seen before then display it
else add it to the set of known lines
In code:
seen = set()
for L in f:
if L in seen:
print(L)
else:
seen.add(L)
If you want to display the line numbers where duplicates are appearing the code can be simply changed to use a dictionary mapping line content to the line number its text has been seen for the first time:
seen = {}
for n, L in enumerate(f):
if L in seen:
print("Line %i is a duplicate of line %i" % (n, seen[L]))
else:
seen[L] = n
Both dict and set in Python are based on hashing and provide constant-time lookup operations.
EDIT
If you need only the line numbers of last duplicate of a line then the output clearly cannot be done during the processing but you will have first to process the whole input before emitting any output...
# lastdup will be a map from line content to the line number the
# last duplicate was found. On first insertion the value is None
# to mark the line is not a duplicate
lastdup = {}
for n, L in enumerate(f):
if L in lastdup:
lastdup[L] = n
else:
lastdup[L] = None
# Now all values that are not None are the last duplicate of a line
result = sorted(x for x in lastdup.values() if x is not None)

Define a variable from a file in Python

I'm trying to take a file that have and take whatever number is on line one (and two and three, etc.) and assign them to a given variable. So say my file is just like:
1
5
6
1
So then how would I take line one and assign that value to variable_a within my code, then take line two and assign it to variable_b. Thanks!
with open(fname) as f:
content = f.readlines()
numbers = [int(x) for x in content]
Try this:
import string
fi = open(fname, "r")
vars = fi.read()
fi.close()
pos = 0
for i in vars.split("\n"):
exec('variable_' + string.lowercase[pos] + '=' + i)
pos += 1
print(variable_b) # will print 5

Reason for two similar codes giving different result and different approaches to this task

The question is
def sum_numbers_in_file(filename):
"""
Return the sum of the numbers in the given file (which only contains
integers separated by whitespace).
>>> sum_numbers_in_file("numbers.txt")
19138
"""
this is my first code:
rtotal = 0
myfile = open(filename,"r")
num = myfile.readline()
num_list = []
while num:
number_line = ""
number_line += (num[:-1])
num_list.append(number_line.split(" "))
num = myfile.readline()
for item in num_list:
for item2 in item:
if item2!='':
rtotal+= int(item2)
return rtotal
this is my second code:
f = open(filename)
m = f.readline()
n = sum([sum([int(x) for x in line.split()]) for line in f])
f.close()
return n
however the first one returns 19138 and the second one 18138
numbers.txt contains the following:
1000
15000
2000
1138
Because m = f.readLine() already reads 1 line from f and then you do the operation with the rest of the lines. If you delete that statement the 2 outputs will be the same. (I think :))
I'd say that m = f.readline() in the second snippet skips the first line (which contains 1000), that's why you get a wrong result.
As requested.. another approach to the question:
import re
def sum(filename):
return sum(int(x.group()) for x in re.finditer(r'\d+',open(filename).read()))
As said by answers, you are skipping first line because f.readline(). But a shorter approach would be:
n=sum((int(line[:-1]) for line in open("numbers.txt") if line[0].isnumeric()))

Categories