How to eliminate last digit from each of the top lines - python

Sequence 1.1.1 ATGCGCGCGATAAGGCGCTA
ATATTATAGCGCGCGCGCGGATATATATATATATATATATT
Sequence 1.2.2 ATATGCGCGCGCGCGCGGCG
ACCCCGCGCGCGCGCGGCGCGATATATATATATATATATATT
Sequence 2.1.1 ATTCGCGCGAGTATAGCGGCG
NOW,I would like to remove the last digit from each of the line that starts with '>'. For example, in this first line, i would like to remove '.1' (rightmost) and in second instance i would like to remove '.2' and then write the rest of the file to a new file. Thanks,

import fileinput
import re
for line in fileinput.input(inplace=True, backup='.bak'):
line = line.rstrip()
if line.startswith('>'):
line = re.sub(r'\.\d$', '', line)
print line
many details can be changed depending on details of the processing you want, which you have not clearly communicated, but this is the general idea.

import re
trimmedtext = re.sub(r'(\d+\.\d+)\.\d', '$1', text)
Should do it. Somewhat simpler than searching for start characters (and it won't effect your DNA chains)

if line.startswith('>Sequence'):
line = line[:-2] # trim 2 characters from the end of the string
or if there could be more than one digit after the period:
if line.startswith('>Sequence'):
dot_pos = line.rfind('.') # find position of rightmost period
line = line[:dot_pos] # truncate upto but not including the dot
Edit for if the sequence occurs on the same line as >Sequence
If we know that there will always be only 1 digit to remove we can cut out the period and the digit with:
line = line[:13] + line[15:]
This is using a feature of Python called slices. The indexes are zero-based and exclusive for the end of the range so line[0:13] will give us the first 13 characters of line. Except that if we want to start at the beginning the 0 is optional so line[:13] does the same thing. Similarly line[15:] gives us the substring starting at character 15 to the end of the string.

map "".join(line.split('.')[:-1]) to each line of the file.

Here's a short script. Run it like: script [filename to clean]. Lots of error handling omitted.
It operates using generators, so it should work fine on huge files as well.
import sys
import os
def clean_line(line):
if line.startswith(">"):
return line.rstrip()[:-2]
else:
return line.rstrip()
def clean(input):
for line in input:
yield clean_line(line)
if __name__ == "__main__":
filename = sys.argv[1]
print "Cleaning %s; output to %s.." % (filename, filename + ".clean")
input = None
output = None
try:
input = open(filename, "r")
output = open(filename + ".clean", "w")
for line in clean(input):
output.write(line + os.linesep)
print ": " + line
except:
input.close()
if output != None:
output.close()

import re
input_file = open('in')
output_file = open('out', 'w')
for line in input_file:
line = re.sub(r'(\d+[.]\d+)[.]\d+', r'\1', line)
output_file.write(line)

Related

Python: To combine lines of a text file and skip certain records

I have an input file like below (please note that there may/may not be blank lines in the file
11111*Author Name
22222*Date
11111 01 Var-1
11111 02 Var-2
11111 02 Var-3
Rules to be used:
If asterisk(*) is present at position # 6 of a record then skip the record.
First 6 bytes are sequence number which can be spaces as well. However, the first six bytes whether space or number can be ignored.
Only combine the records where asterisk is not present at position # 6.
Only consider data starting from position 7 in the input file up to positon 72.
Add comma as shown below
Expected Output
01,Var-1,02,Var-2,02,Var-3
Below is the code that I was trying to print the record. However, I was not able to get comma(,) after each text. Some were prefixed with spaces. Can someone please help?
with open("D:/Desktop/Files/Myfile.txt","r") as file_in:
for lines in file_in:
if "*" not in lines:
lines_new = " ".join(lines.split())
lines_fin = lines_new.replace(' ',',')
print(lines_fin,end=' ')
Assuming you just want to print them one after another (they will still be on separate lines)
with open("D:/Desktop/Files/Myfile.txt","r") as file_in:
for line in file_in:
if line == "\n": # skip empty lines
continue
if line[5] == "*": #skip if asterix at 6th position
continue
line = line.strip() # remove trailing and starting whitespace
line = line.replace(' ', ',') # replace remaining spaces with commas
print(line, ',')
If you just want them all combined then a better way to do it would be:
with open("D:/Desktop/Files/Myfile.txt","r") as f:
all_lines = f.readlines()
all_lines = [line.strip().replace(" ",",") for line in all_lines if line != "\n" and line[5] != "*"]
all_lines = ",".join(all_lines)
I havent tested this so may have typos!
I think a regex solution would be elegant
You would need to handle the limit of 72 for the length of data, but that should not be a problem.
import re
pattern = r'[\s\d]{6}(.+)'
out = []
with open('combinestrings.txt', 'r') as infile:
for line in infile:
result = re.findall(pattern, line)
if result:
out.append(','.join(result[0].split(' ')))
print(','.join(out))
output:
01,Var-1,02,Var-2,02,Var-3
I would use Python's pathlib as it has some useful capabilities for handling paths and reading text files.
To join items together it is useful to put the items you want to join in a Python list and then use the join method on the list.
I have also changed the logic of how you are splitting the data. When a line is kept, the line is always the first 6 characters removed so these can be sliced off. If you do that first it makes the split on whitespace cleaner as you get the two items you are seeking.
There seemed to be a requirement to truncate the data if it was longer than 72 characters so I limited the line of data to 72 characters.
This is what my test code looked like:
from pathlib import Path
data_file = Path("D:/Desktop/Files/Myfile.txt")
field_size = 72
def combine_file_contents(filename):
combined_data = []
for line in filename.read_text().splitlines():
if line and line[5] != "*":
combined_data.extend(line[6:field_size].split())
return ','.join(combined_data)
if __name__ == '__main__':
expected_output = "01,Var-1,02,Var-2,02,Var-3"
output_data = combine_file_contents(data_file)
print("New Output: ", output_data)
print("Expected Output:", expected_output)
assert output_data == expected_output
This gave the following output when I ran with the test data from the question:
New Output: 01,Var-1,02,Var-2,02,Var-3
Expected Output: 01,Var-1,02,Var-2,02,Var-3

python fileinput find and replace line

I am trying to find a line starts with specific string and replace entire line with new string
I tried this code
filename = "settings.txt"
for line in fileinput.input(filename, inplace=True):
print line.replace('BASE_URI =', 'BASE_URI = "http://example.net"')
This one not replacing entire line but just a matching string. what is best way to replace entire line starting with string ?
You don't need to know what old is; just redefine the entire line:
import sys
import fileinput
for line in fileinput.input([filename], inplace=True):
if line.strip().startswith('BASE_URI ='):
line = 'BASE_URI = "http://example.net"\n'
sys.stdout.write(line)
Are you using the python 2 syntax. Since python 2 is discontinued, I will try to solve this in python 3 syntax
suppose you need to replace lines that start with "Hello" to "Not Found" then you can do is
lines = open("settings.txt").readlines()
newlines = []
for line in lines:
if not line.startswith("Hello"):
newlines.append(line)
else:
newlines.append("Not Found")
with open("settings.txt", "w+") as fh:
for line in newlines:
fh.write(line+"\n")
This should do the trick:
def replace_line(source, destination, starts_with, replacement):
# Open file path
with open(source) as s_file:
# Store all file lines in lines
lines = s_file.readlines()
# Iterate lines
for i in range(len(lines)):
# If a line starts with given string
if lines[i].startswith(starts_with):
# Replace whole line and use current line separator (last character (-1))
lines[i] = replacement + lines[-1]
# Open destination file and write modified lines list into it
with open(destination, "w") as d_file:
d_file.writelines(lines)
Call it using this parameters:
replace_line("settings.txt", "settings.txt", 'BASE_URI =', 'BASE_URI = "http://example.net"')
Cheers!

Getting the line number of a string

Suppose I have a very long string taken from a file:
lf = open(filename, 'r')
text = lf.readlines()
lf.close()
or
lineList = [line.strip() for line in open(filename)]
text = '\n'.join(lineList)
How can one find specific regular expression's line number in this string( in this case the line number of 'match'):
regex = re.compile(somepattern)
for match in re.findall(regex, text):
continue
Thank you for your time in advance
Edit: Forgot to add that the pattern that we are searching is multiple lines and I am interested in the starting line.
We need to get re.Match objects rather than strings themselves using re.finditer, which will allow getting information about starting position. Consider following example: lets say I want to find every two digits which are located immediately before and after newline (\n) then:
import re
lineList = ["123","456","789","ABC","XYZ"]
text = '\n'.join(lineList)
for match in re.finditer(r"\d\n\d", text, re.MULTILINE):
start = match.span()[0] # .span() gives tuple (start, end)
line_no = text[:start].count("\n")
print(line_no)
Output:
0
1
Explanation: After I get starting position I simply count number of newlines before that place, which is same as getting number of line. Note: I assumed line numbers are starting from 0.
Perhaps something like this:
lf = open(filename, 'r')
text_lines = lf.readlines()
lf.close()
regex = re.compile(somepattern)
for line_number, line in enumerate(text_lines):
for match in re.findall(regex, line):
print('Match found on line %d: %s' % (line_number, match))

Match an element of every line

I have a list of rules for a given input file for my function. If any of them are violated in the file given, I want my program to return an error message and quit.
Every gene in the file should be on the same chromosome
Thus for a lines such as:
NM_001003443 chr11 + 5997152 5927598 5921052 5926098 1 5928752,5925972, 5927204,5396098,
NM_001003444 chr11 + 5925152 5926098 5925152 5926098 2 5925152,5925652, 5925404,5926098,
NM_001003489 chr11 + 5925145 5926093 5925115 5926045 4 5925151,5925762, 5987404,5908098,
etc.
Each line in the file will be variations of this line
Thus, I want to make sure every line in the file is on chr11
Yet I may be given a file with a different list of chr(and any number of numbers). Thus I want to write a function that will make sure whatever number is found on chr in the line is the same for every line.
Should I use a regular expression for this, or what should I do? This is in python by the way.
Such as: chr\d+ ?
I am unsure how to make sure that whatever is matched is the same in every line though...
I currently have:
from re import *
for line in file:
r = 'chr\d+'
i = search(r, line)
if i in line:
but I don't know how to make sure it is the same in every line...
In reference to sajattack's answer
fp = open(infile, 'r')
for line in fp:
filestring = ''
filestring +=line
chrlist = search('chr\d+', filestring)
chrlist = chrlist.group()
for chr in chrlist:
if chr != chrlist[0]:
print('Every gene in file not on same chromosome')
Just read the file and have a while loop check each line to make sure it contains chr11. There are string functions to search for substrings in a string. As soon as you find a line that returns false (does not contain chr11) then break out of the loop and set a flag valid = false.
import re
fp = open(infile, 'r')
fp.readline()
tar = re.findall(r'chr\d+', fp.readline())[0]
for line in fp:
if (line.find(tar) == -1):
print("Not valid")
break
This should search for a number in the line and check for validity.
Is it safe to assume that the first chr is the correct one? If so, use this:
import re
chrlist = re.findall("chr[0-9]+", open('file').read())
# ^ this is a list with all chr(whatever numbers)
for chr in chrlist:
if chr != chrlist[0]
print("Chr does not match")
break
My solution uses a "match group" to collect the matched numbers from the "chr" string.
import re
pat = re.compile(r'\schr(\d+)\s')
def chr_val(line):
m = re.search(pat, line)
if m is not None:
return m.group(1)
else:
return ''
def is_valid(f):
line = f.readline()
v = chr_val(line)
if not v:
return False
return all(chr_val(line) == v for line in f)
with open("test.txt", "r") as f:
print("The file is {0}".format("valid" if is_valid(f) else "NOT valid"))
Notes:
Pre-compiles the regular expression for speed.
Uses a raw string (r'') to specify the regular expression.
The pattern requires white space (\s) on either side of the chr string.
is_valid() returns False if the first line doesn't have a good chr value. Then it returns a Boolean value that is true if all of the following lines match the chr value of the first line.
Your sample code just prints something like The file is True so I made it a bit friendlier.

How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.

Categories