extract the dimensions from the head lines of text file - python

Please see following attached image showing the format of the text file. I need to extract the dimensions of data matrix indicated by the first line in the file, here 49 * 70 * 1 for the case shown by the image. Note that the length of name "gd_fac" can be varying. How can I extract these numbers as integers? I am using Python 3.6.

Specification is not very clear. I am assuming that the information you want will always be in the first line, and always be in parenthesis. After that:
with open(filename) as infile:
line = infile.readline()
string = line[line.find('(')+1:line.find(')')]
lst = string.split('x')
This will create the list lst = [49, 70, 1].
What is happening here:
First I open the file (you will need to replace filename with the name of your file, as a string. The with ... as ... structure ensures that the file is closed after use. Then I read the first line. After that. I select only the parts of that line that fall after the open paren (, and before the close paren ). Finally, I break the string into parts, with the character x as the separator. This creates a list that contains the values in the first line of the file, which fall between parenthesis, and are separated by x.

Since you have mentioned that length of 'gd_fac' van be variable, best solution will be using Regular Expression.
import re
with open("a.txt") as fh:
for line in fh:
if '(' in line and ')' in line:
dimension = re.findall(r'.*\((.*)\)',line)[0]
break
print dimension
Output:
'49x70x1'

What this does is it looks for "gd_fac"
then if it's there is removes all the unneeded stuff and replaces it with just what you want.
with open('test.txt', 'r') as infile:
for line in infile:
if("gd_fac" in line):
line = line.replace("gd_fac", "")
line = line.replace("x", "*")
line = line.replace("(","")
line = line.replace(")","")
print (line)
break
OUTPUT: "49x70x1"

Related

I'm trying to solve this Python exercise but I have no idea of how to do it: get first character of a line from a file + length of the line

I am learning Python on an app called SoloLearn, got to solve this exercise and I cannot see the solution or see the comments, I don't need to solve it to continue but I'd like to know how to do it.
Book Titles: You have been asked to make a special book categorization program, which assigns each book a special code based on its title.
The code is equal to the first letter of the book, followed by the number of characters in the title.
For example, for the book "Harry Potter", the code would be: H12, as it contains 12 characters (including the space).
You are provided a books.txt file, which includes the book titles, each one written on a separate line.
Read the title one by one and output the code for each book on a separate line.
For example, if the books.txt file contains:
Some book
Another book
Your program should output:
S9
A12
Recall the readlines() method, which returns a list containing the lines of the file.
Also, remember that all lines, except the last one, contain a \n at the end, which should not be included in the character count.
I tried:
file = open("books.txt","r")
for line in file:
for i in range(len(file.readlines())):
title = line[0]+str(len(line)-1)
print(titulo)
title = line[0]+str(len(line)-1)
print(title)
file.close
I also tried with range() and readlines() but I don't know how to solve it
This uses readlines():
with open('books.txt') as f: # Open file
for line in f.readlines(): # Iterate through lines
if line[-1] == '\n': # Check if there is '\n' at end of line
line = line[:-1] # If there is, ignore it
print(line[0], len(line), sep='') # Output first character and length
But I think splitlines() is easier, as it doesn't have the trailing '\n':
with open('books.txt') as f: # Open file
for line in f.read().splitlines(): # Iterate through lines
# No need to check for trailing '\n'
print(line[0], len(line), sep='') # Output first character and length
You can use "with" to handle file oppening and closing.
Use rstrip to get rid of '\n'.
with open('books.txt') as f:
lines = file.readlines()
for line in lines:
print(line[0] + str(len(line.rstrip())))
This is the same:
file = open('books.txt')
lines = file.readlines()
for line in lines:
print(line[0] + str(len(line.rstrip())))
file.close()

How do I print only the first instance of a string in a text file using Python?

I am trying to extract data from a .txt file in Python. My goal is to capture the last occurrence of a certain word and show the next line, so I do a reverse () of the text and read from behind. In this case, I search for the word 'MEC', and show the next line, but I capture all occurrences of the word, not the first.
Any idea what I need to do?
Thanks!
This is what my code looks like:
import re
from file_read_backwards import FileReadBackwards
with FileReadBackwards("camdex.txt", encoding="utf-8") as file:
for l in file:
lines = l
while line:
if re.match('MEC', line):
x = (file.readline())
x2 = (x.strip('\n'))
print(x2)
break
line = file.readline()
The txt file contains this:
MEC
29/35
MEC
28,29/35
And with my code print this output:
28,29/35
29/35
And my objetive is print only this:
28,29/35
This will give you the result as well. Loop through lines, add the matching lines to an array. Then print the last element.
import re
with open("data\camdex.txt", encoding="utf-8") as file:
result = []
for line in file:
if re.match('MEC', line):
x = file.readline()
result.append(x.strip('\n'))
print(result[-1])
Get rid of the extra imports and overhead. Read your file normally, remembering the last line that qualifies.
with ("camdex.txt", encoding="utf-8") as file:
for line in file:
if line.startswith("MEC"):
last = line
print(last[4:-1]) # "4" gets rid of "MEC "; "-1" stops just before the line feed.
If the file is very large, then reading backwards makes sense -- seeking to the end and backing up will be faster than reading to the end.

How to extract last line of text in Python (excluding new lines)?

Textfile:
1
2
3
4
5
6
\n
\n
I know lines[-1] gets you the last line, but I want to disregard any new lines and get the last line of text (6 in this case).
The best approach regarding memory is to exhaust the file. Something like this:
with open('file.txt') as f:
last = None
for line in (line for line in f if line.rstrip('\n')):
last = line
print last
It can be done more elegantly though. A slightly different approach:
with open('file.txt') as f:
last = None
for last in (line for line in f if line.rstrip('\n')):
pass
print last
For a small file you can just read all of the lines, discarding any empty ones. Notice that I've used an inner generator to strip the lines before excluding them in the outer one.
with open(textfile) as fp:
last_line = [l2 for l2 in (l1.strip() for l1 in fp) if l2][-1]
with open('file') as f:
print([i for i in f.read().split('\n') if i != ''][-1])
This is just an edit to Avinash Raj's answer (but since I'm a new account, I can't comment on it). This will preserve any None values in your data (i.e. if the data in your last line is "None" it will work, though depending on your input this may not be an issue).
with open('path/to/file') as infile:
for line in infile:
if not line.strip('\n'):
continue
answer = line
print(answer)
This will print 6 with a newline at the end. You can decide how to strip that. Following are some options:
answer.rstrip('\n') removes trailing newlines
answer.rstrip() removes trailing whitespaces
answer.strip() removes any surrounding whitespaces
with open ('file.txt') as myfile:
for num,line in enumerate(myfile):
pass
print num

\n appending at the end of each line

I am writing lines one by one to an external files. Each line has 9 columns separated by Tab delimiter. If i split each line in that file and output last column, i can see \n being appended to the end of the 9 column. My code is:
#!/usr/bin/python
with open("temp", "r") as f:
for lines in f:
hashes = lines.split("\t")
print hashes[8]
The last column values are integers, either 1 or 2. When i run this program, the output i get is,
['1\n']
['2\n']
I should only get 1 or 2. Why is '\n' being appended here?
I tried the following check to remove the problem.
with open("temp", "r") as f:
for lines in f:
if lines != '\n':
hashes = lines.split("\t")
print hashes[8]
This too is not working. I tried if lines != ' '. How can i make this go away? Thanks in advance.
Try using strip on the lines to remove the \n (the new line character). strip removes the leading and trailing whitespace characters.
with open("temp", "r") as f:
for lines in f.readlines():
if lines.strip():
hashes = lines.split("\t")
print hashes[8]
\n is the newline character, it is how the computer knows to display the data on the next line. If you modify the last item in the array hashes[-1] to remove the last character, then that should be fine.
Depending on the platform, your line ending may be more than just one character. Dos/Windows uses "\r\n" for example.
def clean(file_handle):
for line in file_handle:
yield line.rstrip()
with open('temp', 'r') as f:
for line in clean(f):
hashes = line.split('\t')
print hashes[-1]
I prefer rstrip() for times when I want to preserve leading whitespace. That and using generator functions to clean up my input.
Because each line has 9 columns, the 8th index (which is the 9th object) has a line break, since the next line starts. Just take that away:
print hashes[8][:-1]

How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.

Categories