I need to read a file in Python, where sections are enclosed by /*! for the beginning of the section and * for its end:
/*!Text
this text is to be printed, but it can expand
several lines
even empty lines, but they have to be printed in the same way they're encountered
this until a * character is found
*
/*!Another section starts here
whatever
*
The objective is to print the lines as they're encountered in each section for now (then I'll have to do some processing). To read a file in Python I have something like this:
# open file
with open(filename) as fh:
fit = enumerate(iter(fh.readline, ''), start=1)
# loop over lines
for lino, line in fit:
if line.startswith('/*!T'):
lino, line = next(fit)
print(lino, line)
Now, instead of printing a single line, I would like to print as many lines until the new line starts with the string '/*!'. In C one would use the peak function, so is there something equivalent in Python?
UPDATE
So I may have done some progress when opening the file in binary mode (I'm using Python 3):
# open file
with open(filename, 'rb') as fh:
fit = enumerate(iter(fh.readline, ''), start=1)
# loop over lines
for lino, line in fit:
if not line:
break
if line.startswith('/*!T'):
while True:
lino, line = next(fit)
print(str(line))
char = fh.read(1)
# back one character
fh.seek(-1,1)
if char == b'*':
break
But it seems to me there has to be a much compact way to do this in Python. Any suggestions?
I'd use a regular expression:
import re
def get_sections(filename):
with open(filename) as f:
data = f.read()
return re.findall(r'(?sm)^/\*!(.*?)^\*', data)
for section in get_sections('inputfile.txt'):
print section
Alternatively, I might create a generator function that yields only the section lines:
def get_section_line(f):
iterator = enumerate(f)
for lno, line in iterator:
if line.startswith("/*!"):
yield lno, line.replace("/*!", "", 1)
for lno, line in iterator:
if line.startswith('*'):
break
yield lno, line
with open('inputfile.txt') as f:
for lno, line in get_section_line(f):
print "%04d %s"%(lno,line.rstrip('\n'))
Finally, here is a solution which maintains the section structure, in case knowing which section you're in matters:
import itertools
def get_sections(f):
it = enumerate(f)
for lno, line in it:
if line.startswith("/*!"):
yield itertools.chain(
[(lno,line.replace("/*!","",1))],
itertools.takewhile(lambda i: not i[1].startswith('*'), it))
with open('inputfile.txt') as f:
for secno, section in enumerate(get_sections(f)):
for lno, line in section:
print "%04d %04d %s"%(secno, lno,line.rstrip('\n'))
You're bound to confuse things if you read & seek fh directly while you're in a loop reading it out of an iterator.
Anyway, this may give you some ideas...
filename = 'test.txt'
with open(filename, 'r') as fh:
for line in fh:
if line.startswith('/*!'):
while True:
line = fh.next()
if line[0] == '*':
#print '* End of section *'
break
print line[:-1] #line already ends in \n
You don't appear to be using the line numbers, so I got rid of the enumeration
Related
I have the following problem. I am supposed to open a CSV file (its an excel table) and read it without using any library.
I tried already a lot and have now the first row in a tuple and this in a list. But only the first line. The header. But no other row.
This is what I have so far.
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
The output should: be every line in a tuple and all the tuples in a list.
My question is now, how can I read the other lines in python?
I am really sorry, I am new to programming all together and so I have a real hard time finding my mistake.
Thank you very much in advance for helping me out!
This problem was many times on Stackoverflow so you should find working code.
But much better is to use module csv for this.
You have wrong indentation and you use return results after reading first line so it exits function and it never try read other lines.
But after changing this there are still other problems so it still will not read next lines.
You use readline() so you read only first line and your loop will works all time with the same line - and maybe it will never ends because you never set text = ''
You should use read() to get all text which later you split to lines using split("\n") or you could use readlines() to get all lines as list and then you don't need split(). OR you can use for line in file: In all situations you don't need while
def read_csv(path):
with open(path, 'r+') as file:
results = []
text = file.read()
for line in text.split('\n'):
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
lines = file.readlines()
for line in lines:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
def read_csv(path):
with open(path, 'r+') as file:
results = []
for line in file:
line = line.rstrip('\n') # remove `\n` at the end of line
items = line.split(',')
results.append(tuple(items))
# after for-loop
return results
All this version will not work correctly if you will '\n' or , inside item which shouldn't be treated as end of row or as separtor between items. These items will be in " " which also can make problem to remove them. All these problem you can resolve using standard module csv.
Your code is pretty well and you are near goal:
with open(path, 'r+') as file:
results=[]
text = file.read()
#while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
Your Code:
with open(path, 'r+') as file:
results=[]
text = file.readline()
while text != '':
for line in text.split('\n'):
a=line.split(',')
b=tuple(a)
results.append(b)
return results
So enjoy learning :)
One caveat is that the csv may not end with a blank line as this would result in an ugly tuple at the end of the list like ('',) (Which looks like a smiley)
To prevent this you have to check for empty lines: if line != '': after the for will do the trick.
How could I print the final line of a text file read in with python?
fi=open(inputFile,"r")
for line in fi:
#go to last line and print it
One option is to use file.readlines():
f1 = open(inputFile, "r")
last_line = f1.readlines()[-1]
f1.close()
If you don't need the file after, though, it is recommended to use contexts using with, so that the file is automatically closed after:
with open(inputFile, "r") as f1:
last_line = f1.readlines()[-1]
Do you need to be efficient by not reading all the lines into memory at once? Instead you can iterate over the file object.
with open(inputfile, "r") as f:
for line in f: pass
print line #this is the last line of the file
Three ways to read the last line of a file:
For a small file, read the entire file into memory
with open("file.txt") as file:
lines = file.readlines()
print(lines[-1])
For a big file, read line by line and print the last line
with open("file.txt") as file:
for line in file:
pass
print(line)
For efficient approach, go directly to the last line
import os
with open("file.txt", "rb") as file:
# Go to the end of the file before the last break-line
file.seek(-2, os.SEEK_END)
# Keep reading backward until you find the next break-line
while file.read(1) != b'\n':
file.seek(-2, os.SEEK_CUR)
print(file.readline().decode())
If you can afford to read the entire file in memory(if the filesize is considerably less than the total memory), you can use the readlines() method as mentioned in one of the other answers, but if the filesize is large, the best way to do it is:
fi=open(inputFile, 'r')
lastline = ""
for line in fi:
lastline = line
print lastline
You could use csv.reader() to read your file as a list and print the last line.
Cons: This method allocates a new variable (not an ideal memory-saver for very large files).
Pros: List lookups take O(1) time, and you can easily manipulate a list if you happen to want to modify your inputFile, as well as read the final line.
import csv
lis = list(csv.reader(open(inputFile)))
print lis[-1] # prints final line as a list of strings
If you care about memory this should help you.
last_line = ''
with open(inputfile, "r") as f:
f.seek(-2, os.SEEK_END) # -2 because last character is likely \n
cur_char = f.read(1)
while cur_char != '\n':
last_line = cur_char + last_line
f.seek(-2, os.SEEK_CUR)
cur_char = f.read(1)
print last_line
This might help you.
class FileRead(object):
def __init__(self, file_to_read=None,file_open_mode=None,stream_size=100):
super(FileRead, self).__init__()
self.file_to_read = file_to_read
self.file_to_write='test.txt'
self.file_mode=file_open_mode
self.stream_size=stream_size
def file_read(self):
try:
with open(self.file_to_read,self.file_mode) as file_context:
contents=file_context.read(self.stream_size)
while len(contents)>0:
yield contents
contents=file_context.read(self.stream_size)
except Exception as e:
if type(e).__name__=='IOError':
output="You have a file input/output error {}".format(e.args[1])
raise Exception (output)
else:
output="You have a file error {} {} ".format(file_context.name,e.args)
raise Exception (output)
b=FileRead("read.txt",'r')
contents=b.file_read()
lastline = ""
for content in contents:
# print '-------'
lastline = content
print lastline
I use the pandas module for its convenience (often to extract the last value).
Here is the example for the last row:
import pandas as pd
df = pd.read_csv('inputFile.csv')
last_value = df.iloc[-1]
The return is a pandas Series of the last row.
The advantage of this is that you also get the entire contents as a pandas DataFrame.
Im trying to delete a specific line (10884121) in a text file that is about 30 million lines long. This is the method I first attempted, however, when I execute it runs for about 20 seconds then gives me a "memory error". Is there a better way to do this? Thanks!
import fileinput
import sys
f_in = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned2.txt'
f_out = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned3.txt'
with open(f_in, 'r') as fin:
with open(f_out, 'w') as fout:
linenums = [10884121]
s = [y for x, y in enumerate(fin) if x not in [line - 1 for line in linenums]]
fin.seek(0)
fin.write(''.join(s))
fin.truncate(fin.tell())
First of all, you were not using the imports; you were trying to write to the input file, and your code read the whole file into memory.
Something like this might do the trick with less hassle - we read line by line,
use enumerate to count the line numbers; and for each line we write it to output if its number is not in the list of ignored lines:
f_in = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned2.txt'
f_out = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned3.txt'
ignored_lines = [10884121]
with open(f_in, 'r') as fin, open(f_out, 'w') as fout:
for lineno, line in enumerate(fin, 1):
if lineno not in ignored_lines:
fout.write(line)
Please try to use:
import fileinput
f_in = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned2.txt'
f_out = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned3.txt'
f = open(f_out,'w')
counter=0
for line in fileinput.input([f_in]):
counter=counter+1
if counter != 10884121:
f.write(line) # python will convert \n to os.linesep, maybe you need to add a os.linesep, check
f.close() # you can omit in most cases as the destructor will call it
There are high chances that you run out of memory since you are trying to store file into list.
Try this below:
import fileinput
import sys
f_in = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned2.txt'
f_out = 'C:\\Users\\Lucas\\Documents\\Python\\Pagelinks\\fullyCleaned3.txt'
_fileOne = open(f_in,'r')
_fileTwo = open(f_out,'w')
linenums = set([10884121])
for lineNumber, line in enumerate(_fileOne):
if lineNumber not in linenums:
_fileTwo.writeLine(line)
_fileOne.close()
_fileTwo.close()
Here we are reading file line by line and excluding lines which are not needed, this may not run out of memory.
You can also try reading file using buffering.
Hope this helps.
How about a generic file filter function?
def file_filter(file_path, condition=None):
"""Yield lines from a file if condition(n, line) is true.
The condition parameter is a callback that receives two
parameters: the line number (first line is 1) and the
line content."""
if condition is None:
condition = lambda n, line: True
with open(file_path) as source:
for n, line in enumerate(source):
if condition(n + 1, line):
yield line
open(f_out, 'w') as destination:
condition = lambda n, line: n != 10884121
for line in file_filter(f_in, condition):
destination.write(line)
I need to read an input text file in python, by streaming line by line. That means load the text file line by line instead of all at once into memory. But my line delimiters are not whitespaces, they are arbitrary characters.
Here is a method on Stack Overflow for loading files line by line:
with open("log.txt") as infile:
for line in infile:
do_something_with(line)
The above is perfect, however I need to change the delimiter from whitespaces to a different character.
How can this be done? Thank you.
import re
def open_delimited(filename, delimiter, chunksize=1024, *args, **kwargs):
with open(filename, *args, **kwargs) as infile:
remainder = ''
for chunk in iter(lambda: infile.read(chunksize), ''):
pieces = re.split(delimiter, remainder+chunk)
for piece in pieces[:-1]:
yield piece
remainder = pieces[-1]
if remainder:
yield remainder
for line in open_delimited("log.txt", delimiter='/'):
print(repr(line))
Python doesn't have a native construct for this. You can write a generator that reads the characters one at a time and accumulates them until you have a whole delimited item.
def items(infile, delim):
item = []
c = infile.read(1)
while c:
if c == delim:
yield "".join(item)
item = []
else:
c = infile.read(1)
item.append(c)
yield "".join(item)
with open("log.txt") as infile:
for item in items(infile, ","): # comma delimited
do_something_with(item)
You will get better performance if you read the file in chunks (say, 64K or so) and split these. However, the logic for this is more complicated since an item may be split across chunks, so I won't go into it here as I'm not 100% sure I'd get it right. :-)
I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.