Separating a file by lines in python - python

I have a .fastq file (cannot use Biopython) that consists of multiple samples in different lines. The file contents look like this:
#sample1
ACGTC.....
+
IIIIDDDDDFF
#sample2
AGCGC....
+
IIIIIDFDFD
.
.
.
#sampleX
ACATAG
+
IIIIIDDDFFF
I want to take the file and separate out each individual set of samples (i.e. lines 1-4, 5-8 and so on until the end of the file) and write each of them to a separate file (i.e. sample1.fastq contains that contents of sample 1 lines 1-4 and so on). Is this doable using loops in python?

You can use defaultdict and regex for this
import re
from collections import defaultdict
# Get file contents
with open("test.fastq", "r") as f:
content = f.read()
samples = defaultdict(list) # Make defaultdict of empty lists
identifier = ""
# Iterate through every line in file
for line in content.split("\n"):
# Find strings which start with #
if re.match("^#.*", line):
# Set identifier to match following lines to this section
identifier = line.replace("#", "")
else:
# Add the line to its identifier
samples[identifier].append(line)
Now all you have to do is save the contents of this default dictionary into multiple files:
# Loop through all samples (and their contents)
for sample_name, sample_items in samples.items():
# Create new file with the name of its sample_name.fastq
# (You might want to change the naming)
with open(f"{sample_name}.fastq", "w") as f:
# Write each element of the sample_items to new line
f.write("\n".join(sample_items))
It might be helpful for you to also include #sample_name in the beginning of the file (first line), but I'm not sure you want that so I haven't added that.
Note that you can adjust the regex settings to only match #sample[number] instead of all #..., if you want that, you can use re.match("^#sample\d+") instead

Related

Python: Access "field" in line

I have the following .txt-File (modified bash emboss-dreg report, the original report has seqtable format):
Start End Strand Pattern Sequence
43392 43420 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT
52037 52064 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC
188334 188360 + regex:[T][G][A][TC][C][CTG]\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC
I would like to access the elements under "sequence" only, to compare them with some variables and delete the whole lines, if the comparison does not give the desired result (using Levenshtein distance for comparison).
But I can't even get started .... :(
I am searching for something like the linux -f option, to directly get to the right "field" in the line to do my comparison.
I came across re.split:
with open(textFile) as f:
for line in f:
cleaned=re.split(r'\t',line)
print(cleaned)
which results in:
[' Start End Strand Pattern Sequence\n']
['\n']
[' 43392 43420 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCACGCCGAATGGAAACACGTTTT\n']
['\n']
[' 52037 52064 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGACCCTGCTTGGCGATCCCGGCGTTTC\n']
['\n']
[' 188334 188360 + regex:[T][G][A][TC][C][CTG]\\D{15,17}[CA][G][T][AT][AT][CTA] TGATCGCGCAACTGCAGCGGGAGTTAC\n']
['\n']
That is the closest I got to "split my lines into elements". I feel like totally going the wrong way, but searching Stack Overflow and google did not result in anything :(
I have never worked with seqtable-format before, so I tried to deal with it as .txt Maybe, there is another approach better for dealing with it?
Python is the main language I am learning, I am not so firm in Bash, but bash-answers for dealing with the issue would be ok for me, too.
I am thankful for any hint/link/help :)
The format itself seems to be using multiple lines as delimiters while your r'\t' is not doing anything (you're instructing Python to split on a literal \t). Also, based on what you've pasted the data is not using a tab delimiter anyway, but a random number of whitespaces to pad the table.
To address both, you can read the file, treat the first line as a header (if you need it), then read the rest line by line, strip the trailing\leading whitespace, check if there is any data there and if there is - further split it on whitespace to get to your line elements:
with open("your_data", "r") as f:
header = f.readline().split() # read the first line as a header
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split() # split the data on whitespace to get your elements
print(elements[-1]) # print the last element
TGATCGCACGCCGAATGGAAACACGTTTT
TGACCCTGCTTGGCGATCCCGGCGTTTC
TGATCGCGCAACTGCAGCGGGAGTTAC
As a bonus, since you have the header, you can turn it into a map and then use 'proxied' named access to get the element you're looking for so you don't need to worry about the element position:
with open("your_data", "r") as f:
# read the header and turn it into a value:index map
header = {v: i for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
elements = line.split()
print(elements[header["Sequence"]]) # print the Sequence element
You can also use a header map to turn your rows into dict structures for even easier access.
UPDATE: Here's how to create a header map and then use it to build a dict out of your lines:
with open("your_data", "r") as f:
# read the header and turn it into an index:value map
header = {i: v for i, v in enumerate(f.readline().split())}
for line in f: # read the rest of the file line-by-line
line = line.strip() # first clear out the whitespace
if line: # check if there is any content left or is it an empty line
# split the line, iterate over it and use the header map to create a dict
row = {header[i]: v for i, v in enumerate(line.split())}
print(row["Sequence"]) # ... or you can append it to a list for later use
As for how to 'delete' lines that you don't want for some reason, you'll have to create a temporary file, loop through your original file, compare your values, write the ones that you want to keep into the temporary file, delete the original file and finally rename the temporary file to match your original file, something like:
import shutil
from tempfile import NamedTemporaryFile
SOURCE_FILE = "your_data" # path to the original file to process
def compare_func(seq): # a simple comparison function for our sequence
return not seq.endswith("TC") # use Levenshtein distance or whatever you want instead
# open a temporary file for writing and our source file for reading
with NamedTemporaryFile(mode="w", delete=False) as t, open(SOURCE_FILE, "r") as f:
header_line = f.readline() # read the header
t.write(header_line) # write the header immediately to the temporary file
header = {v: i for i, v in enumerate(header_line.split())} # create a header map
last_line = "" # a var to store the whitespace to keep the same format
for line in f: # read the rest of the file line-by-line
row = line.strip() # first clear out the whitespace
if row: # check if there is any content left or is it an empty line
elements = row.split() # split the row into elements
# now lets call our comparison function
if compare_func(elements[header["Sequence"]]): # keep the line if True
t.write(last_line) # write down the last whitespace to the temporary file
t.write(line) # write down the current line to the temporary file
else:
last_line = line # store the whitespace for later use
shutil.move(t.name, SOURCE_FILE) # finally, overwrite the source with the temporary file
This will produce the same file sans the second row from your example since its sequence ends in a TC and our comp_function() returns False in that case.
For a bit less complexity, instead of using temporary files you can load your whole source file into the working memory and then just overwrite it, but that would work only for files that can fit your working memory while the above approach can work with files as large as your free storage space.

Extracting fasta sequences in list files order

I need to extract some fasta sequences from "goodProteins.fasta" file (first input) with id list files present in separate folder (second input).
The format of the fasta sequence file is:
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
>1_12258
QWERTYUHKDJKDJOKK......
>1_12259
DJHFDSQWERTYUHKDJKDJOKK......
>1_12260
ADKKHDFHJQWERTYUHKDJKDJOKK......
and the format of one of the id file is:
1_12258
1_12256
1_12257
I'm using the following script:
from Bio import SeqIO
import glob
def process(wanted_file, result_file):
fasta_file = "goodProteins.fasta" # First input (Fasta sequence)
wanted = set()
with open(wanted_file) as f:
for line in f:
line = line.strip()
if line != "":
wanted.add(line)
fasta_sequences = SeqIO.parse(open(fasta_file),'fasta')
with open(result_file, "w") as f:
for seq in fasta_sequences:
if seq.id in wanted:
SeqIO.write([seq], f, "fasta")
listFilesArr = glob.glob("My_folder\*txt") # takes all .txt files as
# Second input in My_folder
for wanted_file in listFilesArr:
result_file = wanted_file[0:-4] + ".fasta"
process(wanted_file, result_file)
It should extract fasta sequence based on the information and order list in the id file and the desired output would be:
>1_12258
QWERTYUHKDJKDJOKK......
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
but I get:
>1_12256
FSKVJLKDFJFDAKJQWERTYU......
>1_12257
SKJFHKDAJHLQWERTYGFDFHU......
>1_12258
QWERTYUHKDJKDJOKK......
That is, in my final output I get the header sorted according to their lower values, but I want them in exactly the same order as described in the list files. I'm not sure how to do it...please help.
I think the root cause of the ordering problem is because wanted is a set which are unordered. Since you want the sequence ids in the wanted_files to determine the ordering, you'd need to store them in something else that preserves order, like a list.
Alternatively, you can just process each line of the wanted_file as it's read. A problem with that approach is it would require you to potentially read through the "goodProteins.fasta" file many times — perhaps once for each line of the wanted_file if its contents aren't in a sorted order.
To avoid that, the entire file can be read in to a memory-resident dictionary whose keys are the sequence ids once using the SeqIO.to_dict() function, and then reused for each wanted_file. You say the file is 50-60 MB, but that isn't too much for most of today's hardware.
Anyway, here's code that attempts to do this. To avoid global variables there's a Process class that reads in the "goodProteins.fasta" file and converts it into a dictionary when an instance of it is created. Instances are callable and reusable, meaning that the same process object can be used with each of the wanted_files without repeatedly reading the sequences file.
Note that the code is untested because I don't have the data files or the Bio module installed on my system — but hopefully it's close enough to help.
from Bio import SeqIO
import glob
class Process(object):
def __init__(self, fasta_file_name):
# read entire fasta file into memory as a dictionary indexed by ID
with open(fasta_file_name, "rU") as fasta_file:
self.fasta_sequences = SeqIO.to_dict(
SeqIO.parse(fasta_file, 'fasta'))
def __call__(self, wanted_file_name, results_file_name):
with open(wanted_file_name, "rU") as wanted, \
open(results_file_name, "w") as results:
for seq_id in (line.strip() for line in wanted):
if seq_id:
SeqIO.write(self.fasta_sequences[seq_id], results, "fasta")
process = Process("goodProteins.fasta") # create process object
# process each wanted file using it
for wanted_file_name in glob.glob(r"My_folder\*.txt"):
results_file_name = wanted_file_name[:-4] + ".fasta"
process(wanted_file_name, results_file_name)

reading from a particular tuple onwards from a file in python

Using seek and tell is not functioning properly as the tell returns the current position in bytes; I need to get the line number rather the position of file pointer to proceed.
I have a file glass.csv and I need to cluster the datasets. Each line in the file contains a number 1,2,3... like the below:
65,1.52172,13.48,3.74,0.90,72.01,0.18,9.61,0.00,0.07,1
66,1.52099,13.69,3.59,1.12,71.96,0.09,9.40,0.00,0.00,1
67,1.52152,13.05,3.65,0.87,72.22,0.19,9.85,0.00,0.17,1
68,1.52152,13.05,3.65,0.87,72.32,0.19,9.85,0.00,0.17,1
69,1.52152,13.12,3.58,0.90,72.20,0.23,9.82,0.00,0.16,1
70,1.52300,13.31,3.58,0.82,71.99,0.12,10.17,0.00,0.03,1
71,1.51574,14.86,3.67,1.74,71.87,0.16,7.36,0.00,0.12,2
72,1.51848,13.64,3.87,1.27,71.96,0.54,8.32,0.00,0.32,2
73,1.51593,13.09,3.59,1.52,73.10,0.67,7.83,0.00,0.00,2
74,1.51631,13.34,3.57,1.57,72.87,0.61,7.89,0.00,0.00,2
142,1.51851,13.20,3.63,1.07,72.83,0.57,8.41,0.09,0.17,2
143,1.51662,12.85,3.51,1.44,73.01,0.68,8.23,0.06,0.25,2
144,1.51709,13.00,3.47,1.79,72.72,0.66,8.18,0.00,0.00,2
145,1.51660,12.99,3.18,1.23,72.97,0.58,8.81,0.00,0.24,2
146,1.51839,12.85,3.67,1.24,72.57,0.62,8.68,0.00,0.35,2
147,1.51769,13.65,3.66,1.11,72.77,0.11,8.60,0.00,0.00,3
148,1.51610,13.33,3.53,1.34,72.67,0.56,8.33,0.00,0.00,3
149,1.51670,13.24,3.57,1.38,72.70,0.56,8.44,0.00,0.10,3
150,1.51643,12.16,3.52,1.35,72.89,0.57,8.53,0.00,0.00,3
I need to take some inputs from those tuples having 1 as the last number and save it in another file, (train.txt), and the remaining in another file, (test.txt). Likewise I need to take certain lines from those having 2 as the last number and append to the first file i.e. train.txt and remaining to test.txt.
I cannot get the second input but appends the first result itself.
The easiest way, assuming that you have a large file and can not simply load the whole file would be to use 1 file for each to do your sorting. If it is a small(ish) input file then just load as a comma separated file using the csv module.
As a quick and dirty method, (assuming smallish files).
data = []
with open('glass.csv', 'r') as infile:
for line in infile:
linedata = [float(val) for val in line.strip().split(',')]
data.append(linedata)
adata = sorted(data, key=lambda items: items[-1])
## Then open both your output files and write them in the required fields.
The default behavior for reading a text file is line-by-line. You can just do something like that:
with open('input.csv', 'r') as f, open('output_1.csv') as output_1, open('output_2.csv') as output_2:
for line in f:
line_fields = line.strip().split()[',']
if line_fields[-1] == '1':
output_1.write(line)
continue
if line_fields[-1] == '2':
output_2.write(line)
Or you can use the CSV module, it's much easier https://docs.python.org/2/library/csv.html

How can I append to the new line of a file while using write()?

In Python:
Let's say I have a loop, during each cycle of which I produce a list with the following format:
['n1','n2','n3']
After each cycle I would like to write to append the produced entry to a file (which contains all the outputs from the previous cycles). How can I do that?
Also, is there a way to make a list whose entries are the outputs of this cycle? i.e.
[[],[],[]] where each internal []=['n1','n2','n3] etc
Writing single list as a line to file
Surely you can write it into a file like, after converting it to string:
with open('some_file.dat', 'w') as f:
for x in xrange(10): # assume 10 cycles
line = []
# ... (here is your code, appending data to line) ...
f.write('%r\n' % line) # here you write representation to separate line
Writing all lines at once
When it comes to the second part of your question:
Also, is there a way to make a list whose entries are the outputs of this cycle? i.e. [[],[],[]] where each internal []=['n1','n2','n3'] etc
it is also pretty basic. Assuming you want to save it all at once, just write:
lines = [] # container for a list of lines
for x in xrange(10): # assume 10 cycles
line = []
# ... (here is your code, appending data to line) ...
lines.append('%r\n' % line) # here you add line to the list of lines
# here "lines" is your list of cycle results
with open('some_file.dat', 'w') as f:
f.writelines(lines)
Better way of writing a list to file
Depending on what you need, you should probably use one of the more specialized formats, than just a text file. Instead of writing list representations (which are okay, but not ideal), you could use eg. csv module (similar to Excel's spreadsheet): http://docs.python.org/3.3/library/csv.html
f=open(file,'a') first para is the path of file,second is the pattern,'a' is append,'w' is write, 'r' is read ,and so on
im my opinion,you can use f.write(list+'\n') to write a line in a loop ,otherwise you can use f.writelines(list),it also functions.
Hope this can help you:
lVals = []
with open(filename, 'a') as f:
for x,y,z in zip(range(10), range(5, 15), range(10, 20)):
lVals.append([x,y,z])
f.write(str(lVals[-1]))

Add multiple sequences from a FASTA file to a list in python

I'm trying to organize file with multiple sequences . In doing so, I'm trying to add the names to a list and add the sequences to a separate list that is parallel with the name list . I figured out how to add the names to a list but I can't figure out how to add the sequences that follow it into separate lists . I tried appending the lines of sequence into an empty string but it appended all the lines of all the sequences into a single string .
all the names start with a '>'
def Name_Organizer(FASTA,output):
import os
import re
in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')
data=''
name_list=[]
for line in in_file:
line=line.strip()
for i in line:
if i=='>':
name_list.append(line)
break
else:
line=line.upper()
if all([k==k.upper() for k in line]):
data=data+line
print data
how do i add the sequences to a list as a set of strings ?
the input file looks like this
If you're working with Python & fasta files, you might want to look into installing BioPython. It already contains this parsing functionality, and a whole lot more.
Parsing a fasta file would be as simple as this:
from Bio import SeqIO
for record in SeqIO.parse('filename.fasta', 'fasta'):
print record.id, record.seq
You need to reset the string when you hit marker lines, like this:
def Name_Organizer(FASTA,output):
import os
import re
in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')
data=''
name_list=[]
seq_list=[]
for line in in_file:
line=line.strip()
for i in line:
if i=='>':
name_list.append(line)
if data:
seq_list.append(data)
data=''
break
else:
line=line.upper()
if all([k==k.upper() for k in line]):
data=data+line
print seq_list
Of course, it might also be faster (depending on how large your files are) to use string joining rather than continually appending:
data = []
# ...
data.append(line) # repeatedly
# ...
seq_list.append(''.join(data)) # each time you get to a new marker line
data = []
I organized it in a dictionary first
# remove white spaces from the lines
lines = [x.strip() for x in open(sys.argv[1]).readlines()]
fasta = {}
for line in lines:
if not line:
continue
# create the sequence name in the dict and a variable
if line.startswith('>'):
sname = line
if line not in fasta:
fasta[line] = ''
continue
# add the sequence to the last sequence name variable
fasta[sname] += line
# just to facilitate the input for my function
lst = list(fasta.values())

Categories