For-loop to count differences in lines with python - python

I have a file filled with lines like this (this is just a small bit of the file):
9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae
The number refers to a cluster, and then it goes 'Genus' 'Species' 'Family'.
What I want to do is write a program that will look through each line and report back to me: a list of the different genera in each cluster, and how many of each of those genera are in the cluster. So I'm interested in cluster number and the first 'word' in each line.
My trouble is that I'm not sure how to get this information. I think I need to use a for-loop, starting at lines that begin with '0.'The output would be a file that looks something like:
Cluster 0: Brucella(2) # Lists cluster, followed by genera in cluster with number, something like that.
Cluster 1: Streptomyces(2)
Cluster 2: Brucella(1)
etc.
Eventually I want to do the same kind of count with the Families in each cluster, and then Genera and Species together. Any thoughts on how to start would be greatly appreciated!

I thought this would be a fun little toy project, so I wrote a little hack to read in an input file like yours from stdin, count and format the output recursively and spit out output that looks a little like yours, but with a nested format, like so:
Cluster 0:
Brucella(2)
melitensis(1)
Brucellaceae(1)
neotomae(1)
Brucellaceae(1)
Streptomyces(1)
neotomae(1)
Brucellaceae(1)
Cluster 1:
Streptomyces(2)
geysiriensis(1)
Streptomycetaceae(1)
minutiscleroticus(1)
Streptomycetaceae(1)
Cluster 2:
Mycobacterium(1)
phocaicum(1)
Mycobacteriaceae(1)
Cluster 7:
Mycobacterium(2)
gastri(1)
Mycobacteriaceae(1)
kansasii(1)
Mycobacteriaceae(1)
Cluster 9:
Hyphomicrobium(2)
facile(2)
Hyphomicrobiaceae(2)
Cluster 10:
Streptomyces(2)
niger(1)
Streptomycetaceae(1)
olivaceiscleroticus(1)
Streptomycetaceae(1)
I also added some junk data to my table so that I could see an extra entry in Cluster 0, separated from the other two... The idea here is that you should be able to see a top level "Cluster" entry and then nested, indented entries for genus, species, family... it shouldn't be hard to extend for deeper trees, either, I hope.
# Sys for stdio stuff
import sys
# re for the re.split -- this can go if you find another way to parse your data
import re
# A global (shame on me) for storing the data we're going to parse from stdin
data = []
# read lines from standard input until it's empty (end-of-file)
for line in sys.stdin:
# Split lines on spaces (gobbling multiple spaces for robustness)
# and trim whitespace off the beginning and end of input (strip)
entry = re.split("\s+", line.strip())
# Throw the array into my global data array, it'll look like this:
# [ "0", "Brucella", "melitensis", "Brucellaceae" ]
# A lot of this code assumes that the first field is an integer, what
# you call "cluster" in your problem description
data.append(entry)
# Sort, first key is expected to be an integer, and we want a numerical
# sort rather than a string sort, so convert to int, then sort by
# each subsequent column. The lamba is a function that returns a tuple
# of keys we care about for each item
data.sort(key=lambda item: (int(item[0]), item[1], item[2], item[3]))
# Our recursive function -- we're basically going to treat "data" as a tree,
# even though it's not.
# parameters:
# start - an integer telling us what line to begin working from so we needn't
# walk the whole tree each time to figure out where we are.
# super - An array that captures where we are in the search. This array
# will have more elements in it as we deepen our traversal of the "tree"
# Initially, it will be []
# In the next ply of the tree, it will be [ '0' ]
# Then something like [ '0', 'Brucella' ] and so on.
# data - The global data structure -- this never mutates after the sort above,
# I could have just used the global directly
def groupedReport(start, super, data):
# Figure out what ply we're on in our depth-first traversal of the tree
depth = len(super)
# Count entries in the super class, starting from "start" index in the array:
count = 0
# For the few records in the data file that match our "super" exactly, we count
# occurrences.
if depth != 0:
for i in range(start, len(data)):
if (data[i][0:depth] == data[start][0:depth]):
count = count + 1
else:
break; # We can stop counting as soon as a match fails,
# because of the way our input data is sorted
else:
count = len(data)
# At depth == 1, we're reporting about clusters -- this is the only piece of
# the algorithm that's not truly abstract, and it's only for presentation
if (depth == 1):
sys.stdout.write("Cluster " + super[0] + ":\n")
elif (depth > 0):
# Every other depth: indent with 4 spaces for every ply of depth, then
# output the unique field we just counted, and its count
sys.stdout.write((' ' * ((depth - 1) * 4)) +
data[start][depth - 1] + '(' + str(count) + ')\n')
# Recursion: we're going to figure out a new depth and a new "super"
# and then call ourselves again. We break out on depth == 4 because
# of one other assumption (I lied before about the abstract thing) I'm
# making about our input data here. This could
# be made more robust/flexible without a lot of work
subsuper = None
substart = start
for i in range(start, start + count):
record = data[i] # The original record from our data
newdepth = depth + 1
if (newdepth > 4): break;
# array splice creates a new copy
newsuper = record[0:newdepth]
if newsuper != subsuper:
# Recursion here!
groupedReport(substart, newsuper, data)
# Track our new "subsuper" for subsequent comparisons
# as we loop through matches
subsuper = newsuper
# Track position in the data for next recursion, so we can start on
# the right line
substart = substart + 1
# First call to groupedReport starts the recursion
groupedReport(0, [], data)
If you make my Python code into a file like "classifier.py", then you can run your input.txt file (or whatever you call it) through it like so:
cat input.txt | python classifier.py
Most of the magic of the recursion, if you care, is implemented using slices of arrays and leans heavily on the ability to compare array slices, as well as the fact that I can order the input data meaningfully with my sort routine. You may want to convert your input data to all-lowercase, if it is possible that case inconsistencies could yield mismatches.

It is easy to do.
create an empty dict {} to store your result, lets call it 'result'
Loop over the data line by line.
Split the line on space to get 4 elements as per your structure, cluster,genus,species,family
Increment counts of genus inside each cluster key when they are found in the current loop, they have to be set to 1 for the first occurence though.
result = { '0': { 'Brucella': 2} ,'1':{'Streptomyces':2}..... }
Code:
my_data = """9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae"""
result = {}
for line in my_data.split("\n"):
cluster,genus,species,family = line.split(" ")
result.setdefault(cluster,{}).setdefault(genus,0)
result[cluster][genus] += 1
print(result)
{'10': {'Streptomyces': 2}, '1': {'Streptomyces': 2}, '0': {'Brucella': 2}, '2': {'Mycobacterium': 1}, '7': {'Mycobacterium': 2}, '9': {'Hyphomicrobium': 2}}

Related

Count occurances of a specific string within multi-valued elements in a set

I have generated a list of genes
genes = ['geneName1', 'geneName2', ...]
and a set of their interactions:
geneInt = {('geneName1', 'geneName2'), ('geneName1', 'geneName3'),...}
I want to find out how many interactions each gene has and put that in a vector (or dictionary) but I struggle to count them. I tried the usual approach:
interactionList = []
for gene in genes:
interactions = geneInt.count(gene)
interactionList.append(ineractions)
but of course the code fails because my set contains elements that are made out of two values while I need to iterate over the single values within.
I would argue that you are using the wrong data structure to hold interactions. You can represent interactions as a dictionary keyed by gene name, whose values are a set of all the genes it interacts with.
Let's say you currently have a process that does something like this at some point:
geneInt = set()
...
geneInt.add((gene1, gene2))
Change it to
geneInt = collections.defaultdict(set)
...
geneInt[gene1].add(gene2)
If the interactions are symmetrical, add a line
geneInt[gene2].add(gene1)
Now, to count the number of interactions, you can do something like
intCounts = {gene: len(ints) for gene, ints in geneInt.items()}
Counting your original list is simple if the interactions are one-way as well:
intCounts = dict.fromkeys(genes, 0)
for gene, _ in geneInt:
intCounts[gene] += 1
If each interaction is two-way, there are three possibilities:
Both interactions are represented in the set: the above loop will work.
Only one interaction of a pair is represented: change the loop to
for gene1, gene2 in geneInt:
intCounts[gene1] += 1
if gene1 != gene2:
intCounts[gene2] += 1
Some reverse interactions are represented, some are not. In this case, transform geneInt into a dictionary of sets as shown in the beginning.
Try something like this,
interactions = {}
for gene in genes:
interactions_count = 0
for tup in geneInt:
interactions_count += tup.count(gene)
interactions[gene] = interactions_count
Use a dictionary, and keep incrementing the value for every gene you see in each tuple in the set geneInt.
interactions_counter = dict()
for interaction in geneInt:
for gene in interaction:
interactions_counter[gene] = interactions_counter.get(gene, 0) + 1
The dict.get(key, default) method returns the value at the given key, or the specified default if the key doesn't exist. (More info)
For the set geneInt={('geneName1', 'geneName2'), ('geneName1', 'geneName3')}, we get:
interactions_counter = {'geneName1': 2, 'geneName2': 1, 'geneName3': 1}

create table of contents in python

Assuming the following text:
# Algorithms
This chapter covers the most basic algorithms.
## Sorting
Quicksort is fast and widely used in practice
Merge sort is a deterministic algorithm
## Searching
DFS and BFS are widely used graph searching algorithms
# Data Structures
more text
## more data structures
How would i create a table of contents in python that if the line starts with # I would replace with 1. for the first #. If the line starts with ## I would replace with 1.1. the second time in the text a # would be seen I need to replace with 2. and so on:
1. Algorithms
1.1. Sorting
1.2. Searching
2. Data Structures
2.1 more data structures
I would begin doing something like :
for line in text:
if line.startswith('#'):
....
but I dont know how to proceed.
You could do something like this:
from collections import defaultdict
levels = defaultdict(int)
max_level = 0
index = []
for line in text:
if not line.startswith("#"):
continue
hashes, caption = line.rstrip().split(maxsplit=1)
level = len(hashes)
max_level = max(level, max_level)
levels[level] += 1
for l in range(level + 1, max_level + 1):
levels[l] = 0
index.append(
".".join(str(levels[l]) for l in range(1, level+1))
+ ". " + caption
)
Some expanations:
levels stores the current state of the counter of the table of contents. It's a defaultdict(int) because its values are ints and that makes it easy to use.
max_level stores the maximal depth of the counter so far. (E.g.: In the step from only single #s to the first ## the maximal depth increases (1->2).)
If a line of the text startswith # then it is split in 2 parts: 1. the hashes and 2. the caption.
The number of hashes (len(hashes)) is the level-depth of the current entry. Just in case it increases the maximum depth so far, max_level gets an update (often doing nothing).
Then the counter for the current level is increaed by 1, and all the counters for the levels beyond the current one get a reset to 0. (E.g.: If the last state of the counter was 1.2 and then the first level gets increased (#), not only must the counter switch the first place (1->2) but also the second level needs a reset (2->0).) Another would be to delete those levels.
The output in the join contains all the individual counters up to the current level (E.g., to get 1.3. instead of 1.3.0.0.0.0. etc.)
Result for
from io import StringIO
text = StringIO(
'''
# Algorithms
This chapter covers the most basic algorithms.
## Sorting
Quicksort is fast and widely used in practice
Merge sort is a deterministic algorithm
## Searching
DFS and BFS are widely used graph searching algorithms
# Data Structures
more text
## more data structures
''')
is
['1. Algorithms',
'1.1. Sorting',
'1.2. Searching',
'2. Data Structures',
'2.1. more data structures']
Another, but similiar approach, without defaultdict would be:
levels = []
index = []
for line in text:
if not line.startswith("#"):
continue
hashes, caption = line.rstrip().split(maxsplit=1)
level = len(hashes)
if level > len(levels):
levels.append(1)
else:
levels[level-1] += 1
levels = levels[:level]
index.append(
".".join(str(l) for l in levels) + ". " + caption
)
You can use recursion:
from collections import defaultdict
def to_table(d, p = []):
r, c, l = defaultdict(list), 0, None
for a, *b in d:
if a != '#' and p:
yield f'{".".join(map(str, p))} {"".join(b)}'
elif a == '#':
r[l:=((c:=c+1) if b[0] != '#' else c)].append(''.join(b))
yield from [j for a, b in r.items() for j in to_table(b, p+[a])]
s = """
# Algorithms
This chapter covers the most basic algorithms.
## Sorting
Quicksort is fast and widely used in practice
Merge sort is a deterministic algorithm
## Searching
DFS and BFS are widely used graph searching algorithms
# Data Structures
more text
## more data structures
"""
print('\n'.join(to_table(list(filter(None, s.split('\n'))))))
Output:
1 Algorithms
1.1 Sorting
1.2 Searching
2 Data Structures
2.1 more data structures
You can keep track of the top-level number and the sub-level number, and substitute the single hashes with the top level number and the double hashes with the format top.sub.
You can refer this

Python Linear Search Better Efficiency

I've got a question regarding Linear Searching in Python. Say I've got the base code of
for l in lines:
for f in search_data:
if my_search_function(l[1],[f[0],f[2]]):
print "Found it!"
break
in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:
def my_search_function(search_key, search_values):
for s in search_values:
if search_key in s:
return True
return False
Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.
for line in lines:
negative_index = -1
positive_index = 0
middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
found = False
while positive_index < middle_element:
# print str(positive_index)+","+str(negative_index)
if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
print "Found it!"
break
positive_index = positive_index +1
negative_index = negative_index -1
However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!
Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.
Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.
Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)
Code to illustrate the second approach:
lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']
try:
iter_keys = iter(sorted(search_keys))
key = next(iter_keys)
for line in sorted(lines):
if line.startswith(key):
print('Line {} matches {}'.format(line, key))
else:
while key < line[:len(key)]:
key = next(iter_keys)
except StopIteration: # all keys processed
pass
Depends on problem detail.
For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.
Filling the hashtable is pseudo-linear.
Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were
def binary_search(master_row, source_data,master_search_index, source_search_index):
lower_bound = 0
upper_bound = len(source_data) - 1
found = False
while lower_bound <= upper_bound and not found:
middle_pos = (lower_bound + upper_bound) // 2
if source_data[middle_pos][source_search_index] < master_row[master_search_index]:
if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
return {"result": True, "index": middle_pos}
break
lower_bound = middle_pos + 1
elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :
if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
return {"result": True, "index": middle_pos}
break
upper_bound = middle_pos - 1
else:
if len(source_data[middle_pos][source_search_index]) > 5:
return {"result": True, "index": middle_pos}
else:
break
and then where we actually make the Binary Search call
#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
for m in master_search_columns:
found = False
for d in data_search_columns:
data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
results = binary_search(line, data_copy,m, d)
found = results["result"]
if found:
line = update_row(line, data_copy[results["index"]], column_mapping)
found_count = found_count +1
break
if found:
break
Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray

Beyond for-looping: high performance parsing of a large, well formatted data file

I am looking to optimize the performance of a big data parsing problem I have using python. In case anyone is interested: the data shown below is segments of whole genome DNA sequence alignments for six primate species.
Currently, the best way I know how to proceed with this type of problem is to open each of my ~250 (size 20-50MB) files, loop through line by line and extract the data I want. The formatting (shown in examples) is fairly regular although there are important changes at each 10-100 thousand line segment. Looping works fine but it is slow.
I have been using numpy recently for processing massive (>10 GB) numerical data sets and I am really impressed at how quickly I am able to perform different computations on arrays. I wonder if there are some high-powered solutions for processing formatted text that circumvents tedious for-looping?
My files contain multiple segments with the pattern
<MULTI-LINE HEADER> # number of header lines mirrors number of data columns
<DATA BEGIN FLAG> # the word 'DATA'
<DATA COLUMNS> # variable number of columns
<DATA END FLAG> # the pattern '//'
<EMPTY LINE>
Example:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_data
SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)
SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)
SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)
DATA
GGGGGG
CCCCTC
...... # continue for 10-100 thousand lines
//
SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017)
DATA
CCCC
GGGG
.... # continue for 10-100 thousand lines
//
<ETC>
I will use segments where the species homo_sapiens and macaca_mulatta are both present in the header, and field 6, which I called the quality flag in the comments above, equals '1' for each species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely.
I care about segment_start and segment_end coordinates for homo_sapiens only, so in segments where homo_sapiens is present, I will record these fields and use them as keys to a dict(). segment_start also tells me the first positional coordinate for homo_sapiens, which increases strictly by 1 for each line of data in the current segment.
I want to compare the letters (DNA bases) for homo_sapiens and macaca_mulatta. The header line where homo_sapiens and macaca_mulatta appear (i.e. 1 and 5 in the first example) correspond to the column of data representing their respective sequences.
Importantly, these columns are not always the same, so I need to check the header to get the correct indices for each segment, and to check that both species are even in the current segment.
Looking at the two lines of data in example 1, the relevant information for me is
# homo_sapiens_coordinate homo_sapiens_base macaca_mulatta_base
11388669 G G
11388670 C T
For each segment containing info for homo_sapiens and macaca_mulatta, I will record start and end for homo_sapiens from the header and each position where the two DO NOT match into a list. Finally, some positions have "gaps" or lower quality data, i.e.
aaa--A
I will only record from positions where homo_sapiens and macaca_mulatta both have valid bases (must be in the set ACGT) so the last variable I consider is a counter of valid bases per segment.
My final data structure for a given file is a dictionary which looks like this:
{(segment_start=i, segment_end=j, valid_bases=N): list(mismatch positions),
(segment_start=k, segment_end=l, valid_bases=M): list(mismatch positions), ...}
Here is the function I have written to carry this out using a for-loop:
def human_macaque_divergence(chromosome):
"""
A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts
:param chromosome: chromosome (integer:
:return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values
"""
ch = str(chromosome)
div_dict = {}
with gz.open('{al}Compara.6_primates_EPO.chr{c}_1.emf.gz'.format(al=pd.align, c=ch), 'rb') as f:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_info
# SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621)
# flags, containers, counters and indices:
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False # a flag signalling that the lines we are viewing are alignment columns
for line in f:
if 'SEQ' in line: # 'SEQ' signifies a segment info field
assert species_data is False
line = line.split()
if line[2] == ch and line[5] == '1': # make sure that the alignment is to the desired chromosome in humans quality_flag is '1'
species += [line[1]] # collect each species in the header
starts += [int(line[3])] # collect starts and ends
ends += [int(line[4])]
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
# get the indices to scan in data columns:
hom = species.index('homo_sapiens')
mac = species.index('macaca_mulatta')
pos = starts[hom] # first homo_sapiens positional coordinate
continue
if species_data and '//' not in line:
assert pos > 0
# record the relevant bases:
human = line[hom]
macaque = line[mac]
if {human, macaque}.issubset(bases):
valid += 1
if human != macaque and {human, macaque}.issubset(bases):
mismatch += [pos]
pos += 1
elif species_data and '//' in line: # '//' signifies segment boundary
# store segment results if a boundary has been reached and data has been collected for the last segment:
div_dict[(starts[hom], ends[hom], valid)] = mismatch
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False
elif not species_data and '//' in line:
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
pos = -1
hom = None
mac = None
return div_dict
This code works fine (perhaps it could use some tweaking), but my real question is whether or not there might be a faster way to pull this data without running the for-loop and examining each line? For example, loading the whole file using f.read() takes less than a second although it creates a pretty complicated string. (In principle, I assume that I could use regular expressions to parse at least some of the data, such as the header info, but I'm not sure if this would necessarily increase performance without some bulk method to process each data column in each segment).
Does anyone have any suggestions as to how I circumvent looping through billions of lines and parse this kind of text file in a more bulk manner?
Please let me know if anything is unclear in comments, happy to edit or respond directly to improve the post!
Yes you could use some regular expressions to make extract the data in one-go; this is probably the best ratio of effort/performances.
If you need more performances, you could use mx.TextTools to build a finite state machine; I'm pretty confident this will be significantly faster, but the effort needed to write the rules and the learning curve might discourage you.
You also could split the data in chunks and parallelize the processing, this could help.
When you have working code and need to improve performance, use a profiler and measure the effect of one optimization at a time. (Even if you don't use the profiler, definitely do the latter.) Your present code looks reasonable, that is, I don't see anything "stupid" in it in terms of performance.
Having said that, it is likely to be worthwhile to use precompiled regular expressions for all string matching. By using re.MULTILINE, you can read in an entire file as a string and pull out parts of lines. For example:
s = open('file.txt').read()
p = re.compile(r'^SEQ\s+(\w+)\s+(\d+)\s+(\d+)\s+(\d+)', re.MULTILINE)
p.findall(s)
produces:
[('homo_sapiens', '1', '11388669', '11532963'),
('pan_troglodytes', '1', '11517444', '11668750'),
('gorilla_gorilla', '1', '11607412', '11751006'),
('pongo_pygmaeus', '1', '218866021', '219020464'),
('macaca_mulatta', '1', '14425463', '14569832'),
('callithrix_jacchus', '7', '45949850', '46115230')]
You will then need to post-process this data to deal with the specific conditions in your code, but the overall result may be faster.
Your code looks good, but there are particular things that could be improved, such as the use of map, etc.
For good guide on performance tips in Python see:
https://wiki.python.org/moin/PythonSpeed/PerformanceTips
I have used the above tips to get code working nearly as fast as C code. Basically, try to avoid for loops (use map), try to use find built-in functions, etc. Make Python work for you as much as possible by using its builtin functions, which are largely written in C.
Once you get acceptable performance you can run in parallel using:
https://docs.python.org/dev/library/multiprocessing.html#module-multiprocessing
Edit:
I also just realized you are opening a compressed gzip file. I suspect a significant amount of time is spent decompressing it. You can try to make this faster by multi-threading it with:
https://code.google.com/p/threadzip/
You can combine re with some fancy zipping in list comprehensions that can replace the for loops and try to squeeze some performance gains. Below I outline a strategy for segmenting the data file read in as an entire string:
import re
from itertools import izip #(if you are using py2x like me, otherwise just use zip for py3x)
s = open('test.txt').read()
Now find all header lines, and the corresponding index ranges in the large string
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
#head
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)',
# '\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
#...
#head_inds
#[(107, 169),
# (169, 234),
# (234, 299),
# (299, 366),
# (366, 430),
# (430, 498),
# (1035, 1097),
# (1097, 1162)
# ...
Now, do the same for the data (lines of code with bases)
data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ d[0] for d in data_info]
data_inds = [ (d[1],d[2]) for d in data_info]
Now, whenever there is a new segment, there will be a discontinuity between head_inds[i][1] and head_inds[i+1][0]. Same for data_inds. We can use this knowledge to find the beginning and end of each segment as follows
head_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( head_inds[:-1], head_inds[1:])) if j[0]-i[1]]
head_seg_pos = [0] + head_seg_pos + [len(head_seg_pos)] # add beginning and end which we will use next
head_segmented = [ head[s1:s2] for s1,s2 in izip( head_seg_pos[:-1], head_seg_pos[1:]) ]
#[['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)'],
#['\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)',
# ...
and the same for the data
data_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( data_inds[:-1], data_inds[1:])) if j[0]-i[1]]
data_seg_pos = [0] + data_seg_pos + [len(data_inds)] # add beginning and end for the next step
data_segmented = [ data[s1:s2] for s1,s2 in izip( data_seg_pos[:-1], data_seg_pos[1:]) ]
Now we can group the segmented data and segmented headers, and only keep groups with data on homo_sapiens and macaca_mulatta
groups = [ [h,d] for h,d in izip( head_segmented, data_segmented) if all( [sp in ''.join(h) for sp in ('homo_sapiens','macaca_mulatta')] ) ]
Now you have a groups array, where each group has
group[0][0] #headers for segment 0
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)']
groups[0][1] # data from segment 0
#['\nGGGGGG',
# '\nCCCCTC',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# ...
The next step in the processing I will leave up to you, so I don't steal all the fun. But hopefully this gives you a good idea on using list comprehension to optimize code.
Update
Consider the simple test case to gauge efficiency of the comprehensions combined with re:
def test1():
with open('test.txt','r') as f:
head = []
for line in f:
if line.startswith('SEQ'):
head.append( line)
return head
def test2():
s = open('test.txt').read()
head = re.findall( '\nSEQ.*', s)
return head
%timeit( test1() )
10000 loops, best of 3: 78 µs per loop
%timeit( test2() )
10000 loops, best of 3: 37.1 µs per loop
Even if I gather additional information using re
def test3():
s = open('test.txt').read()
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
%timeit( test3() )
10000 loops, best of 3: 50.6 µs per loop
I still get speed gains. I believe this may be faster in your case to use list comprehensions. However, the for loop might actually beat the comprehension (I take back what I said before) in end, consider
def test1(): #similar to how you are reading in the data in your for loop above
with open('test.txt','r') as f:
head = []
data = []
species = []
species_data = False
for line in f:
if line.startswith('SEQ'):
head.append( line)
species.append( line.split()[1] )
continue
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
continue
if species_data and '//' not in line:
data.append( line )
continue
if species_data and line.startswith( '//' ):
species_data = False
species = []
continue
return head, data
def test3():
s = open('test.txt').read()
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ h[0] for h in data_info]
data_inds = [ (h[1],h[2]) for h in data_info]
return head,data
In this case, as the iterations become more complex, the traditional for loop wins
In [24]: %timeit(test1())
10000 loops, best of 3: 135 µs per loop
In [25]: %timeit(test3())
1000 loops, best of 3: 256 µs per loop
Though I can still use re.findall twice and beat the for loop:
def test4():
s = open('test.txt').read()
head = re.findall( '\nSEQ.*',s )
data = re.findall( '\n[AGTC-]+.*',s)
return head,data
In [37]: %timeit( test4() )
10000 loops, best of 3: 79.5 µs per loop
I guess as the processing of each iteration becomes increasingly complex, the for loop will win, though there might be a more clever way to continue on with re. I wish there was a standard way to determine when to use either.
File processing with Numpy
The data itself appears to be completely regular and can be processed easily with Numpy. The header is only a tiny part of the file and the processing speed thereof is not very relevant. So the idea is to swich to Numpy for only the raw data and, other than that, keep the existing loops in place.
This approach works best if the number of lines in a data segment can be determined from the header. For the remainder of this answer I assume this is indeed the case. If this is not possible, the starting and ending points of data segments have to be determined with e.g. str.find or regex. This will still run at "compiled C speed" but the downside being that the file has to be looped over twice. In my opinion, if your files are only 50MB it's not a big problem to load a complete file into RAM.
E.g. put something like the following under if species_data and '//' not in line:
# Define `import numpy as np` at the top
# Determine number of rows from header data. This may need some
# tuning, if possible at all
nrows = max(ends[i]-starts[i] for i in range(len(species)))
# Sniff line length, because number of whitespace characters uncertain
fp = f.tell()
ncols = len(f.readline())
f.seek(fp)
# Load the data without loops. The file.read method can do the same,
# but with numpy.fromfile we have it in an array from the start.
data = np.fromfile(f, dtype='S1', count=nrows*ncols)
data = data.reshape(nrows, ncols)
# Process the data without Python loops. Here we leverage Numpy
# to really speed up the processing.
human = data[:,hom]
macaque = data[:,mac]
valid = np.in1d(human, bases) & np.in1d(macaque, bases)
mismatch = (human != macaque)
pos = starts[hom] + np.flatnonzero(valid & mismatch)
# Store
div_dict[(starts[hom], ends[hom], valid.sum())] = pos
# After using np.fromfile above, the file pointer _should_ be exactly
# in front of the segment termination flag
assert('//' in f.readline())
# Reset the header containers and flags
...
So the elif species_data and '//' in line: case has become redundant and the containers and flags can be reset in the same block as the above. Alternatively, you could also remove the assert('//' in f.readline()) and keep the elif species_data and '//' in line: case and reset containers and flags there.
Caveats
For relying on the file pointer to switch between processing the header and the data, there is one caveat: (in CPython) iterating a file object uses a read-ahead buffer causing the file pointer to be further down the file than you'd expect. When you would then use numpy.fromfile with that file pointer, it skips over data at the start of the segment and moreover it reads into the header of the next segment. This can be fixed by exclusively using the file.readline method. We can conveniently use it as an iterator like so:
for line in iter(f.readline, ''):
...
For determining the number of bytes to read with numpy.fromfile there is another caveat: Sometimes there is a single line termination character \n at the end of a line and other times two \r\n. The first is the convention on Linux/OSX and the latter on Windows. There is os.linesep to determine the default, but obviously for file parsing this isn't robust enough. So in the code above the length of a data line is determined by actually reading a line, checking the len, and putting back the file pointer to the start of the line.
When you encounter a data segment ('DATA' in line) and the desired species are not in it, you should be able to calculate an offset and f.seek(f.tell() + offset) to the header of the next segment. Much better than looping over data you're not even interested in!

Iterate through a loop to change a conditional statement, python

New to programming in general. This is my code
for b in range(LengthSpread):
Strip = ReadSpread[b].rstrip('\n')
SplitData = ReadSpread[b].split(",")
PlotID = SplitData[1]
PlotIDnum = float(PlotID)
if PlotIDnum == 1:
List = SplitData
print List
OpenBlank.writelines('%s\n\n\n\n\n' % List)
Ultimately I want to find data based on changing each plotIDnum in the overall dataset. How would I change the number in the conditional if statement, without physically changing the number. Possibly using a for loop, or a while loop. Can't wrap my mind around it.
This is an example of the inputdata
09Tree #PlotID PlotID
1 1 Tree
2 1 Tree
3 2 Tree
4 2 Tree
6 4 Tree
7 5 Tree
8 5 Tree
9 5 Tree
I want my output to be organized by plotID#, and place each output in either a new spreadsheet or have each unique dataset in a new tab
Thanks for any help
I'm not sure how exactly you would like to organize your files, but maybe you could use the plot ID as part of the file name (or name of the tab or whatever). This way you don't even need the extra loop, for example:
for b in range(length_spread):
data = read_spread[b].rstrip('\n')
splitted = data.split(',')
plot_id = splitted[1] # Can keep it as a string
filename = 'plot_id_' + plot_id + '.file_extension'
spreadsheet = some_open_method(filename, option='append')
spreadsheet.writelines('%s\n\n\n\n\n' % splitted)
spreadsheet.close_method()
Perhaps you could also make use of the with statement:
with some_open_method(filename) as spreadsheet:
spreadsheet.writelines('%s\n\n\n\n\n' % splitted)
This ensures (if your file-object supports this) that the file is properly closed even if your program encounters an exception during writing to the file.
If you want to use some kind of extra loop I think this is the simplest case, assuming you know all the plot ID's beforehand:
all_ids = [1, 2, 4, 5]
# Note: using plot_id as integer now
for plot_id in all_ids:
filename = 'plot_id_%i.file_extension' % plot_id
spreadsheet = some_open_method(filename, option='write')
for b in range(length_spread):
data = read_spread[b].rstrip('\n')
splitted = data.split(',')
if plot_id == int(splitted[1]):
spreadsheet.writelines('%s\n\n\n\n\n' % splitted)
spreadsheet.close_method()

Categories