How do I quickly extract data from this massive csv file?

How do I quickly extract data from this massive csv file? - python

I have genomic data from 16 nuclei. The first column represents the nucleus, the next two columns represent the scaffold (section of genome) and the position on the scaffold respectively, and the last two columns represent the nucleotide and coverage respectively. There can be equal scaffolds and positions in different nuclei.
Using input for start and end positions (scaffold and position of each), I'm supposed to output a csv file which shows the data (nucleotide and coverage) of each nucleus within the range from start to end. I was thinking of doing this by having 16 columns (one for each nucleus), and then showing the data from top to bottom. The leftmost region would be a reference genome in that range, which I accessed by creating a dictionary for each of its scaffolds.
In my code, I have a defaultdict of lists, so the key is a string which combines the scaffold and the location, while the data is an array of lists, so that for each nucleus, the data can be appended to the same location, and in the end each location has data from every nucleus.
Of course, this is very slow. How should I be doing it instead?
Code:
#let's plan this
#input is start and finish - when you hit first, add it and keep going until you hit next or larger
#dictionary of arrays
#loop through everything, output data for each nucleus
import csv
from collections import defaultdict
inrange = 0
start = 'scaffold_41,51335'
end = 'scaffold_41|51457'
locations = defaultdict(list)
count = 0
genome = defaultdict(lambda : defaultdict(dict))
scaffold = ''
for line in open('Allpaths_SL1_corrected.fasta','r'):
if line[0]=='>':
scaffold = line[1:].rstrip()
else:
genome[scaffold] = line.rstrip()
print('Genome dictionary done.')
with open('automated.csv','rt') as read:
for line in csv.reader(read,delimiter=','):
if line[1] + ',' + line[2] == start:
inrange = 1
if inrange == 1:
locations[line[1] + ',' + line[2]].append([line[3],line[4]])
if line[1] + ',' + line[2] == end:
inrange = 0
count += 1
if count%1000000 == 0:
print('Checkpoint '+str(count)+'!')
with open('region.csv','w') as fp:
wr = csv.writer(fp,delimiter=',',lineterminator='\n')
for key in locations:
nuclei = []
for i in range(0,16):
try:
nuclei.append(locations[key][i])
except IndexError:
nuclei.append(['',''])
wr.writerow([genome[key[0:key.index(',')][int(key[key.index(',')+1:])-1],key,nuclei])
print('Done!')
Files:
https://drive.google.com/file/d/0Bz7WGValdVR-bTdOcmdfRXpUYUE/view?usp=sharing
https://drive.google.com/file/d/0Bz7WGValdVR-aFdVVUtTbnI2WHM/view?usp=sharing

(Only focusing on the CSV section in the middle of your code)
The example csv file you supplied is over 2GB and 77,822,354 lines. Of those lines, you seem to only be focused on 26,804,253 lines or about 1/3.
As a general suggestion, you can speed thing up by:
Avoid processing the data you are not interested in (2/3 of the file);
Speed up identifying the data you are interested in;
Avoid the things that repeated millions of times that tend to be slower (processing each line as csv, reassembling a string, etc);
Avoid reading all data when you can break it up into blocks or lines (memory will get tight)
Use faster tools like numpy, pandas and pypy
You data is block oriented, so you can use a FlipFlop type object to sense if you are in a block or not.
The first column of your csv is numeric, so rather than splitting the line apart and reassembling two columns, you can use the faster Python in operator to find the start and end of the blocks:
start = ',scaffold_41,51335,'
end = ',scaffold_41,51457,'
class FlipFlop:
def __init__(self, start_pattern, end_pattern):
self.patterns = start_pattern, end_pattern
self.state = False
def __call__(self, st):
rtr=True if self.state else False
if self.patterns[self.state] in st:
self.state = not self.state
return self.state or rtr
lines_in_block=0
with open('automated.csv') as f:
ff=FlipFlop(start, end)
for lc, line in enumerate(f):
if ff(line):
lines_in_block+=1
print lines_in_block, lc
Prints:
26804256 77822354
That runs in about 9 seconds in PyPy and 46 seconds in Python 2.7.
You can then take the portion that reads the source csv file and turn that into a generator so you only need to deal with one block of data at a time.
(Certainly not correct, since I spent no time trying to understand your files overall..):
def csv_bloc(fn, start_pat, end_pat):
from itertools import ifilter
with open(fn) as csv_f:
ff=FlipFlop(start_pat, end_pat)
for block in ifilter(ff, csv_f):
yield block
Or, if you need to combine all the blocks into one dict:
def csv_line(fn, start, end):
with open(fn) as csv_in:
ff=FlipFlop(start, end)
for line in csv_in:
if ff(line):
yield line.rstrip().split(",")
di={}
for row in csv_line('/tmp/automated.csv', start, end):
di.setdefault((row[2],row[3]), []).append([row[3],row[4]])
That executes in about 1 minute on my (oldish) Mac in PyPy and about 3 minutes in cPython 2.7.
Best

Related

How to convert a text file where data is separated by rows using ids, to an easy lookup (dictionary, json, etc.) preserving whitespacing/lines?

I have a text file that looks like this
start_id=372
text: this is some text with
vartions in
white spacing and such
END_OF_RECORD
start_id=3453
text: Continued for
this record
that has other
variations in whitespacing
END_OF_RECORD
I need to convert this such that I can easily access the data with the preserved whitespacing and lines.
So something like this
result = function('start_id=3453')
result
returns
text: Continued for
this record
that has other
variations in whitespacing
The reason I need to preserve the whitespacing is because I need to look stuff by span. So
result[11:14]
results in
Con
Strategies I have though up of:
I have an algorithm that goes down the lines and searches if the line starts with 'start_id'. When I do, I go down the line until I reach end of line or whitespace, and record this span into a dictionary key.
Then I go down the lines until I hit 'END_OF_RECORD'.
I then somehow copy the whole line span into the dictionary value for that key.
My concerns about this method if there are any edge cases I am not thinking of, and how to copy whole several lines into a python value.

That should be actually quite simple if you use regex... something like this:
import re
dictionary_of_records = {} # records will be stored here
recording = False # this will allow me to prevent starting new record while recording and will hold the id
# matchers
start = re.compile(r'start_id=(\d*)')
end = re.compile(r'END_OF_RECORD')
with open('stack.txt') as file:
for line in file.readlines():
if start.match(line):
if recording:
raise Exception('Attempting to create new record without ending previous one!')
print('Start... matched!')
recording = start.search(line).group(1) # save the id of recording
print(f'Starting record with id {recording}')
current_record_string = '' # make a empty string to save recording to
elif end.match(line):
print('End... matched!')
dictionary_of_records[recording] = current_record_string # save the entry to dict
recording = False # reset recording to False
continue
elif not recording:
continue
else:
current_record_string += line
print(dictionary_of_records)

IIUC:
data = {}
for line in open('/content/deidentified-medical-text-1.0/id.text').readlines():
if line.startswith('START_OF_RECORD='):
id_ = line.strip().split('=')[1]
lines = []
elif line.startswith('||||END_OF_RECORD'):
data[id_] = ''.join(lines)
id_ = None
lines = []
elif id_:
lines.append(line)
>>> data
{372: 'text: this is some text with\nvartions in\n\n\n white spacing and such\n',
3453: 'text: Continued for\n\nthis record\n that has other\n\n\nvariations in whitespacing\n'}
>>> data[3453][11:14]
'Con'

Is there a foolproof way of matching two similar string sequences?

This is the CSV in question. I got this data from Extra History, which is a video series that talks about many historical topics through many mini-series, such as 'Rome: The Punic Wars', or 'Europe: The First Crusades' (in the CSV). Episodes in these mini-series are numbered #1 up to #6 (though Justinian Theodora has two series, the first numbered #1-#6, the second #7-#12).
I would like to do some statistical analysis on these mini-series, which entails sorting these numbered episodes (e.g. episodes #2-#6) into their appropriate series, i.e. end result look something like this; I can then easily automate sorting into the appropriate python list.
My python code matches the #2-#6 episodes to the #1 episode correctly 99% of the time, with only 1 big error in red, and 1 slight error in yellow (because the first episode of that series is #7, not #1). However, I get the nagging feeling that there is an easier and foolproof way since the strings are well organized, with regular patterns in their names. Is that possible? And can I achieve that with my current code, or should I change and approach it from a different angle?
import csv
eh_csv = '/Users/Work/Desktop/Extra History Playlist Video Data.csv'
with open(eh_csv, newline='', encoding='UTF-8') as f:
reader = csv.reader(f)
data = list(reader)
import re
series_first = []
series_rest = []
singles_music_lies = []
all_episodes = [] #list of name of all episodes
#seperates all videos into 3 non-overlapping list: first episodes of a series,
#the other numbered episodes of the series, and singles/music/lies videos
for video in data:
all_episodes.append(video[0])
#need regex b/c normall string search of #1 also matched (Justinian &) Theodora #10
if len(re.findall('\\b1\\b', video[0])) == 1:
series_first.append(video[0])
elif '#' not in video[0]:
singles_music_lies.append(video[0])
else:
series_rest.append(video[0])
#Dice's Coefficient
#got from here; John Rutledge's answer with NinjaMeTimbers modification
#https://stackoverflow.com/questions/653157/a-better-similarity-ranking-algorithm-for-variable-length-strings
#------------------------------------------------------------------------------------------
def get_bigrams(string):
"""
Take a string and return a list of bigrams.
"""
s = string.lower()
return [s[i:i+2] for i in list(range(len(s) - 1))]
def string_similarity(str1, str2):
"""
Perform bigram comparison between two strings
and return a percentage match in decimal form.
"""
pairs1 = get_bigrams(str1)
pairs2 = get_bigrams(str2)
union = len(pairs1) + len(pairs2)
hit_count = 0
for x in pairs1:
for y in pairs2:
if x == y:
hit_count += 1
pairs2.remove(y)
break
return (2.0 * hit_count) / union
#-------------------------------------------------------------------------------------------
#only take couple words of the episode's names for comparison, b/c the first couple words are 99% of the
#times the name of the series; can't make too short or words like 'the, of' etc will get matched (now or
#in future), or too long because will increase chance of superfluous match; does much better than w/o
#limitting to first few words
def first_three_words(name_string):
#eg ''.join vs ' '.join
first_three = ' '.join(name_string.split()[:5]) #-->'The Haitian Revolution' slightly worse
#first_three = ''.join(name_string.split()[:5]) #--> 'TheHaitianRevolution', slightly better
return first_three
#compared given episode with all first videos, and return a list of comparison scores
def compared_with_first(episode, series_name = series_first):
episode_scores = []
for i in series_name:
x = first_three_words(episode)
y = first_three_words(i)
#comparison_score = round(string_similarity(episode, i),4)
comparison_score = round(string_similarity(x,y),4)
episode_scores.append((comparison_score, i))
return episode_scores
matches = []
#go through video number 2,3,4 etc in a series and compare them with the first episode
#of all series, then get a comparison score
for episode in series_rest:
scores_list = compared_with_first(episode)
similarity_score = 0
most_likely_match = []
#go thru list of comparison scores returned from compared_with_first,
#then append the currentepisode/highest score/first episode to
#most_likely_match; repeat for all non-first episodes
for score in scores_list:
if score[0] > similarity_score:
similarity_score = score[0]
most_likely_match.clear() #MIGHT HAVE BEEN THE CRUCIAL KEY
most_likely_match.append((episode,score))
matches.append(most_likely_match)
final_match = []
for i in matches:
final_match.append((i[0][0], i[0][1][1], i[0][1][0]))
#just to get output in desired presentation
path = '/Users/Work/Desktop/'
with open('EH Sorting Episodes.csv', 'w', newline='',encoding='UTF-8') as csvfile:
csvwriter = csv.writer(csvfile)
for currentRow in final_match:
csvwriter.writerow(currentRow)
#print(currentRow)

Data Loss reading from one file, processing and writing to another? Python

I wrote a script to transform a large 4MB textfile with 40k+ lines of unordered data to a specifically formatted and easier to deal with CSV file.
Problem:
Analyzing my file sizes, it appears i've lost over 1MB of data (20K Lines | edit: original file was 7MB so lost ~4MB of data), and when I attempt to search specific data points present in CommaOnly.txt in sorted_CSV.csv I cannot find them.
I found this really weird so.
What I tried:
I searched for and replaced all unicode chars present in the CommaOnly.txt that might be causing a problem.. No luck!
Example: \u0b99 replaced with " "
Here's an example of some data loss
A line from: CommaOnly.txt
name,SJ Photography,category,Professional Services,
state,none,city,none,country,none,about,
Capturing intimate & milestone moment from pregnancy and family portraits to weddings
Sorted_CSV.csv
Not present.
What could be causing this?
Code:
import re
import csv
import time
# Final Sorted Order for all data:
#['name', 'data',
# 'category','data',
# 'about', 'data',
# 'country', 'data',
# 'state', 'data',
# 'city', 'data']
## Recieves String Item, Splits on "," Delimitter Returns the split List
def split_values(string):
string = string.strip('\n')
split_string = re.split(',', string)
return split_string
## Iterates through the list, reorganizes terms in the desired order at the desired indices
## Adds the field if it does not initially
def reformo_sort(list_to_sort):
processed_values=[""]*12
for i in range(11):
try:
## Terrible code I know, but trying to be explicit for the question
if(i==0):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="name"):
processed_values[0]=(list_to_sort[j])
processed_values[1]=(list_to_sort[j+1])
## append its neighbour
## if after iterating, name does not appear, add it.
if(processed_values[0] != "name"):
processed_values[0]="name"
processed_values[1]="None"
elif(i==2):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="category"):
processed_values[2]=(list_to_sort[j])
processed_values[3]=(list_to_sort[j+1])
if(processed_values[2] != "category"):
processed_values[2]="category"
processed_values[3]="None"
elif(i==4):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="about"):
processed_values[4]=(list_to_sort[j])
processed_values[5]=(list_to_sort[j+1])
if(processed_values[4] != "about"):
processed_values[4]="about"
processed_values[5]="None"
elif(i==6):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="country"):
processed_values[6]=(list_to_sort[j])
processed_values[7]=(list_to_sort[j+1])
if(processed_values[6]!= "country"):
processed_values[6]="country"
processed_values[7]="None"
elif(i==8):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="state"):
processed_values[8]=(list_to_sort[j])
processed_values[9]=(list_to_sort[j+1])
if(processed_values[8] != "state"):
processed_values[8]="state"
processed_values[9]="None"
elif(i==10):
for j in range(len(list_to_sort)):
if(list_to_sort[j]=="city"):
processed_values[10]=(list_to_sort[j])
processed_values[11]=(list_to_sort[j+1])
if(processed_values[10] != "city"):
processed_values[10]="city"
processed_values[11]="None"
except:
print("failed to append!")
return processed_values
# Converts desired data fields to a string delimitting values by ','
def to_CSV(values_to_convert):
CSV_ENTRY=str(values_to_convert[1])+','+str(values_to_convert[3])+','+str(values_to_convert[5])+','+str(values_to_convert[7])+','+str(values_to_convert[9])+','+str(values_to_convert[11])
return CSV_ENTRY
with open("CommaOnly.txt", 'r') as c:
print("Starting.. :)")
for line in c:
entry = c.readline()
to_sort = split_values(entry)
now_sorted = reformo_sort(to_sort)
CSV_ROW=to_CSV(now_sorted)
with open("sorted_CSV.csv", "a+") as file:
file.write(str(CSV_ROW)+"\n")
print("Finished! :)")
time.sleep(60)

I've rewritten the main loop that seems fishy to me, using csv package.
Your reformo_sort routine is incomplet and syntaxically incorrect, with empty elif blocks and missing processing, so I got incomplete lines, but that should work much better than your code. Note the usage of csv, the "binary" flag, the single open in write mode instead of open/close each line (much faster) and the 1-out-of-2 filtering of the now_sorted array.
with open("CommaOnly.txt", 'rb') as c:
print("Starting.. :)")
cr = csv.reader(c,delimiter=",",quotechar='"')
with open("sorted_CSV.csv", "wb") as fw:
cw = csv.writer(fw,delimiter=",",quotechar='"')
for to_sort in cr:
now_sorted = reformo_sort(to_sort)
cw.writerow(now_sorted[1::2])

Python calculate ORFs from any arbitrary reading frame

I have a big fasta file in this format:
>gi|142022655|gb|EQ086233.1|522 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
AAGACGGGCACCGTGTCCTTCGCGACGTACTCCGACCAGTTGTACACGTTCAGGTTGGTGTCGCCGGCAT
GGGCCGACAGGCTGGCCGCGACGGCCAGCGCCGCCGACGTGACGCGCGCGGCGCGCAACGCCGATTGACG
ACGGATACGGATACGCATGGGGATTCTCCTTGTGATGGGGATCGGCCGTTGCGCCCGGTCCGGGTCCGGA
CTCGCGTCAACGCCGTCGAGCGGTGTTCAGCACAAGGGCCAATGTAGAGATCGCGGCCGGCAGCGTCAGT
CCCGAAAACCGGGACAAACGGCGACGTCGATTCCCGCCGTTTGGGTAGATTCCCGCGTAGGCAGTCGAAA
ATATTCGTGATACCTGTAGCGCCACCTGAAAATCTTCGATACACGACGCCATGAGCGCTGCGCTGCCCGC
CCCCGATCTTCCGCTGAGCCACGTCGCGTTCGTGACTGAAACGCTGGGCGACATCGCACAAGCCGTCGGA
ACGCCGCAGTTCATGCGCGCCGTCTACGACACGCTCGTGCGCTACGTCGATTTCGACGCCGTGCACCTCG
ACTACGAGCGCAGCGCGTCTTCCGGCCGGCGCAGCGTCGGCTGGATCGGCAGCTTCGGCCGCGAGCCCGA
GCTGGTCGCGCAGGTGATGCGCCACTACTACCGCAGCTACGCGAGCGACGATGCAACTTACGCGGCGATC
GAAACCGAAAACGACGTGCAATTGCTGCAGGTGTCCGCGCAACGCGTGTCGAGCGAGCTACGGCATCTGT
TCTTCGATGCCGGCGACATTCATGACGAATGCGTGATCGCCGGCGTGACGGGCGGCACGCGCTACTCGAT
CTCGATCGCGCGCTCACGGCGGCTGCCGCCGTTTTCGCTGAAGGAACTGAGCCTGCTGAAGCAGCTTTCG
CAAGTCGTGCTGCCGCTGGCGTCCGCGCACAAGCGCCTGCTCGGCGCGATCTCCGCCGACGACGCACCGC
GCGACGAACTCGATCTCGACCTCGTCGCGCAATGGCTGCCGGAATGGCAGGAACGGTTGACCGCGCGCGA
GATGCATGTGTGTGCGTCGTTCATCCAGGGCATGACGTCGGCGGCCATCGCCCAATCGATGGGGCTCAAG
ACCTCCACCGTCGATACCTACGCGAAGCGCGCCTTCGCGAAGCTCGGCGTCGATTCGCGAAGGCAACTGA
TGACCCTCGTGCTGAGAAACGCGTCGCGGCGGCATGACGCATAGCATCC
>gi|142022655|gb|EQ086233.1|598 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
TTGCCGCCGGCCGCAGCCGGCTTGGCACCACGCTGCGGCTGGTCGCCGGACTTCGGCTTCGCGCCGGTGT
CCGCCGGCGCTGCCGGCCGCTTCGCGTTGCGCTCCTGCTTGGCCTTCGCTGCGAGCTGCGCCCGCAATTC
GGCAAGTTGTTCAAAACCCATAAATTCAATCCACCAGGAATATAAGGTGTGGTTCGTGCGGCCATGCCGC
GCGGCGCACGAGCTTCGCCGCCATGCGTGCGACCCGTCTGCCGCCGATGCGGAATACTACGGGGCCGCAT
>gi|142022655|gb|EQ086233.1|143 marine metagenome JCVI_SCAF_1096627390048 genomic scaffold, whole genome shotgun sequence
CTGATGCGTGCGCGCGGCCGCCTGCAGCCAGCGCGTCAGTTCCGGCGCCGCCGCGCGGCTGTAGTTCAGCGCG
CCGCCGCGATCGACGGGCAGGTAATGGCCTTCGATGTCGATGCCGTCCGGCGGCGTGTTCGAGTTCGCGA
TCGAGCCGAACTTGCCGGTCTTGCGCGCCTCGACGTACGTGCCGTCGTCGACGTACTGGATCTTCAGGTC
GACGCCGAGCCGCTGCCGCGCCTGCGCCTGCAGCGCCTGCAGCAGCACGTCGCGCTGGTCGCGCACGGTC
I want to be able to find out the length of the longest open reading frame (ORF) appearing in reading frame 3 of any of the sequences?
So far, I have tried some code that lists out all the ORFs of one sequence, inputted as a string:
import re
from string import maketrans
pattern = re.compile(r'(?=(ATG(?:...)*?)(?=TAG|TGA|TAA))')
def revcomp(dna_seq):
return dna_seq[::-1].translate(maketrans("ATGC","TACG"))
def orfs(dna):
return set(pattern.findall(dna) + pattern.findall(revcomp(dna)))
print orfs(Seq)
where Seq='''CTGATGCGTGCGCGCGGCCGCCTGCAGCCAGCGCGTCAGTTCCGGCGCCGCCGCGCGGCTGTAGTTCAGCGCGCCGCCGCGATCGACGGGCAGGTAATGGCCTTCGATGTCGATGCCGTCCGGCGGCGTGTTCGAGTTCGCGATCGAGCCGAACTTGCCGGTCTTGCGCGCCTCGACGTACGTGCCGTCGTCGACGTACTGGATCTTCAGGTCGACGCCGAGCCGCTGCCGCGCCTGCGCCTGCAGCGCCTGCAGCAGCACGTCGCGCTGGTCGCGCACGGTC''' Notice that this is the 3rd entry in the big fasta file format above.
My sample output to this is: set([]), so I am clearly doing something terribly wrong. My code doesn't even scale to multiple entries (i.e., it only takes a single DNA string, called Seq)
Can anyone point me in the right direction please?
EDIT:
N.B.: ATG is the start codon (i.e., the beginning of an ORF) and TAG, TGA, and TAA are stop codons (i.e., the end of an ORF).

EDITED 1: Completely rewritten to better match problem description.
I don't know the exact file format here, so am assuming it carries on the same way as the three sequences you show -- one sequence after another.
If I understand correctly, the reason you didn't see a match in the third sequence is that there actually isn't a match there. There are matches in the first two, though, and you will see them if you run this.
'''
import re
import string
with open('dna.txt', 'rb') as f:
data = f.read()
data = [x.split('\n', 1) for x in data.split('>')]
data = [(x[0], ''.join(x[1].split())) for x in data if len(x) == 2]
start, end = [re.compile(x) for x in 'ATG TAG|TGA|TAA'.split()]
revtrans = string.maketrans("ATGC","TACG")
def get_longest(starts, ends):
''' Simple brute-force for now. Optimize later...
Given a list of start locations and a list
of end locations, return the longest valid
string. Returns tuple (length, start position)
Assume starts and ends are sorted correctly
from beginning to end of string.
'''
results = {}
# Use smallest end that is bigger than each start
ends.reverse()
for start in starts:
for end in ends:
if end > start and (end - start) % 3 == 0:
results[start] = end + 3
results = [(end - start, start) for
start, end in results.iteritems()]
return max(results) if results else (0, 0)
def get_orfs(dna):
''' Returns length, header, forward/reverse indication,
and longest match (corrected if reversed)
'''
header, seqf = dna
seqr = seqf[::-1].translate(revtrans)
def readgroup(seq, group):
return list(x.start() for x in group.finditer(seq))
f = get_longest(readgroup(seqf, start), readgroup(seqf, end))
r = get_longest(readgroup(seqr, start), readgroup(seqr, end))
(length, index), s, direction = max((f, seqf, 'forward'), (r, seqr, 'reverse'))
return length, header, direction, s[index:index + length]
# Process entire file
all_orfs = [get_orfs(x) for x in data]
# Put in groups of 3
all_orfs = zip(all_orfs[::3], all_orfs[1::3], all_orfs[2::3])
# Process each group of 3
for x in all_orfs:
x = max(x) # Only pring longest in each group
print(x)
print('')

Your requirements aren't clear to me. In your question you say "I want to be able to find out the length of the longest open reading frame (ORF) appearing in reading frame 3 of any of the sequences?" In subsequent comments, you say "I'm only interested in the comprehensive calculation of each ORF from any arbitrary reading frame. They are ALL important to me."
Assuming that the later is what you're after, here's a simple way to get all the ORFs from a collection of sequences in fasta format, using BioPython to look after much of the work.
import io # Only needed because input is in string form
from Bio import Seq, SeqIO
import regex as re
startP = re.compile('ATG')
def get_orfs(nuc):
orfs = []
for m in startP.finditer(nuc, overlapped=True):
pro = Seq.Seq(nuc)[m.start():].translate(to_stop=True)
orfs.append(nuc[m.start():m.start()+len(pro)*3+3])
return orfs
for fasta in SeqIO.parse(io.StringIO(fasta_inputs), 'fasta'):
header = fasta.description
orfs = get_orfs(str(fasta.seq))
print(header, orfs)
Notes:
Normally you'd read a fasta collection from a file. Since here it's in string format, we used io.StringIO to make it easily compatible with SeqIO.parse from BioPython
The get_orfs function finds ATGs and returns the ORG originating from each one. If you're also interested in frames 4 through 6, you'll need the reverse_complement of the sequence.
If you're only interested in the longest ORF from each fasta sequence, you could have the get_orfs function return (max(orfs, key=len))
It's marginally more difficult if you're only interested in ORFs starting with ATG in a specific frame (e.g. frame 3). The simplest approach there might be to simply find all ORFs from the frame and then discard those not starting with ATG.

Parsing through sequence output- Python

I have this data from sequencing a bacterial community.
I know some basic Python and am in the midst of completing the codecademy tutorial.
For practical purposes, please think of OTU as another word for "species"
Here is an example of the raw data:
OTU ID OTU Sum Lineage
591820 1083 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
532752 517 k__Bacteria; p__Fusobacteria; c__Fusobacteria; o__Fusobacteriales; f__Fusobacteriaceae; g__u114; s__
218456 346 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__Bordetella; s__
590248 330 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Alcaligenaceae; g__; s__
343284 321 k__Bacteria; p__Proteobacteria; c__Betaproteobacteria; o__Burkholderiales; f__Comamonadaceae; g__Limnohabitans; s__
The data includes three things: a reference number for the species, how many times that species is in the sample, and the taxonomy of said species.
What I'm trying to do is add up all the times a sequence is found for a taxonomic family (designated as f_x in the data).
Here is an example of the desired output:
f__Fusobacteriaceae 1600
f__Alcaligenaceae 676
f__Comamonadaceae 321
This isn't for a class. I started learning python a few months ago, so I'm at least capable of looking up any suggestions. I know how it works out from doing it the slow way (copy & paste in excel), so this is for future reference.

If the lines in your file really look like this, you can do
from collections import defaultdict
import re
nums = defaultdict(int)
with open("file.txt") as f:
for line in f:
items = line.split(None, 2) # Split twice on any whitespace
if items[0].isdigit():
key = re.search(r"f__\w+", items[2]).group(0)
nums[key] += int(items[1])
Result:
>>> nums
defaultdict(<type 'int'>, {'f__Comamonadaceae': 321, 'f__Fusobacteriaceae': 1600,
'f__Alcaligenaceae': 676})

Yet another solution, using collections.Counter:
from collections import Counter
counter = Counter()
with open('data.txt') as f:
# skip header line
next(f)
for line in f:
# Strip line of extraneous whitespace
line = line.strip()
# Only process non-empty lines
if line:
# Split by consecutive whitespace, into 3 chunks (2 splits)
otu_id, otu_sum, lineage = line.split(None, 2)
# Split the lineage tree into a list of nodes
lineage = [node.strip() for node in lineage.split(';')]
# Extract family node (assuming there's only one)
family = [node for node in lineage if node.startswith('f__')][0]
# Increase count for this family by `otu_sum`
counter[family] += int(otu_sum)
for family, count in counter.items():
print "%s %s" % (family, count)
See the docs for str.split() for the details of the None argument (matching consecutive whitespace).

Get all your raw data and process it first, I mean structure it and then use the structured data to do any sort of operations you desire.
In case if you have GB's of data you can use elasticsearch. Feed your raw data and query with your required string in this case f_* and get all entries and add them

That's very doable with basic python. Keep the Library Reference under your pillow, as you'll want to refer to it often.
You'll likely end up doing something similar to this (I'll write it the longer-more-readable way -- there's ways to compress the code and do this quicker).
# Open up a file handle
file_handle = open('myfile.txt')
# Discard the header line
file_handle.readline()
# Make a dictionary to store sums
sums = {}
# Loop through the rest of the lines
for line in file_handle.readlines():
# Strip off the pesky newline at the end of each line.
line = line.strip()
# Put each white-space delimited ... whatever ... into items of a list.
line_parts = line.split()
# Get the first column
reference_number = line_parts[0]
# Get the second column, convert it to an integer
sum = int(line_parts[1])
# Loop through the taxonomies (the rest of the 'columns' separated by whitespace)
for taxonomy in line_parts[2:]:
# skip it if it doesn't start with 'f_'
if not taxonomy.startswith('f_'):
continue
# remove the pesky semi-colon
taxonomy = taxonomy.strip(';')
if sums.has_key(taxonomy):
sums[taxonomy] += int(sum)
else:
sums[taxonomy] = int(sum)
# All done, do some fancy reporting. We'll leave sorting as an exercise to the reader.
for taxonomy in sums.keys():
print("%s %d" % (taxonomy, sums[taxonomy]))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How do I quickly extract data from this massive csv file? - python

Related

How to convert a text file where data is separated by rows using ids, to an easy lookup (dictionary, json, etc.) preserving whitespacing/lines?

Is there a foolproof way of matching two similar string sequences?

Data Loss reading from one file, processing and writing to another? Python

Python calculate ORFs from any arbitrary reading frame

Parsing through sequence output- Python

Categories

Resources