Improve performance of dataframe-like structure - python

I'm facing a data-structure challenge regarding a process in my code code in which I need to count the frequency of strings in positive and negative examples.
It's a large bottleneck and I don't seem to be able to find a better solution.
I have to go through every long strings in the dataset, and extract substrings, of which I need to count the frequency. In a perfect world, a pandas dataframe of the following shape would be perfect:
At the end, the expected structure is something like
string | frequency positive | frequency negative
________________________________________________
str1 | 5 | 7
str2 | 2 | 4
...
However, for obvious performance limits, this is not acceptable.
My solution is to use a dictionary to track the rows, and a Nx2 numpy matrix to track the frequency. This is also done because after this, I need to have the frequency in a Nx2 numpy matrix anyway.
Currently, my solution is something like this:
str_freq = np.zeros((N, 2), dtype=np.uint32)
str_dict = {}
str_dict_counter = 0
for i, string in enumerate(dataset):
substrings = extract(string) # substrings is a List[str]
for substring in substrings:
row = str_dict.get(substring, None)
if row is None:
str_dict[substring] = str_dict_counter
row = str_dict_counter
str_dict_counter += 1
str_freq[row, target[i]] += 1 # target[i] is equal to 1 or 0
However, it is really the bottleneck of my code, and I'd like to speed it up.
Some things about this code are imcompressible, for instance the extract(string), so that loop has to remain. However, if possible there is no problem with using parallel processing.
What I'm wondering especially is if there is a way to improve the inner loop. Python is known to be bad with loops, and this one seems a bit pointless, however since we can't (to my knowledge) do multiple get and sets to dictionaries like we can with numpy arrays, I don't know how I could improve it.
What do you suggest doing? Is the only solution to re-write in some lower level language?
I also though about using SQL-lite, however I don't know if it's worth it.
For the record, this has to take in about 10MB of data, it currently takes about 45 seconds, but needs to be done repeatedly with new data each time.
EDIT: Added example to test yourself
import random
import string
import re
import numpy as np
import pandas as pd
def get_random_alphaNumeric_string(stringLength=8):
return bytes(bytearray(np.random.randint(0,256,stringLength,dtype=np.uint8)))
def generate_dataset(n=10000):
d = []
for i in range(n):
rnd_text = get_random_alphaNumeric_string(stringLength=1000)
d.append(rnd_text)
return d
def test_dict(dataset):
pattern = re.compile(b"(q.{3})")
target = np.random.randint(0,2,len(dataset))
str_freq = np.zeros((len(dataset)*len(dataset[0]), 2), dtype=np.uint32)
str_dict = {}
str_dict_counter = 0
for i, string in enumerate(dataset):
substrings = pattern.findall(string) # substrings is a List[str]
for substring in substrings:
row = str_dict.get(substring, None)
if row is None:
str_dict[substring] = str_dict_counter
row = str_dict_counter
str_dict_counter += 1
str_freq[row, target[i]] += 1 # target[i] is equal to 1 or 0
return str_dict, str_freq[:str_dict_counter,:]
def test_df(dataset):
pattern = re.compile(b"(q.{3})")
target = np.random.randint(0,2,len(dataset))
df = pd.DataFrame(columns=["str","pos","neg"])
df.astype(dtype={"str":bytes, "pos":int, "neg":int}, copy=False)
df = df.set_index("str")
for i, string in enumerate(dataset):
substrings = pattern.findall(string) # substrings is a List[str]
for substring in substrings:
check = substring in df.index
if not check:
row = [0,0]
row[target[i]] = 1
df.loc[substring] = row
else:
df.loc[substring][target[i]] += 1
return df
dataset = generate_dataset(1000000)
d,f = test_dict(dataset) # takes ~10 seconds on my laptop
# to get the value of some key, say b'q123'
f[d[b'q123'],:]
d = test_df(dataset) # takes several minutes (hasn't finished yet)
# the same but with a dataframe
d.loc[b'q123']

Related

How do I know if a position is in a region using python

I have 2 files.
File A has 3 columns: chromosome,start position,end position.
CHR,START,END
chr1,1203245,1203374
chr1,1202020,1202213
chr1,1201293,1201465
chr1,1200844,1201128
chr1,1200527,1200585
File B has 2 columns: chromosome,position.
CHR,POS
chr1,1579264
chr1,1641372
chr1,3020521
chr2,2097836
chr3,2374462
Both files are large.
How do I decide every position of file B is in any region of file A using python? I can write code if there is just one position and one region, but I don't have a clue for a list of region.
Position is in a region means CHR should be same and POS >= START and POS <= END, for example, chr1,1203250 of file B should be in chr1,1203245,1203374 of file A
I take oscarbranson's advice and write code:
for i,r in B.iterrows():
B.loc[i, 'in_A'] = any((r.CHR == A.CHR) & (r.POS >= A.SATRT) & (r.POS <= A.END))
But both of 2 files are large and the code is still running. If there is any way to make this faster?
If your files are big you cannot just load both in a dataframe, as it would fill your entire memory.
Do not compare the position of B to every interval in A, because this is terribly inefficient (understand: very very slow).
Instead, do this:
Sort both files by chrom, start, end.
Iterate on lines of both files concurrently. Something like this (for the idea; it does not take the chromosome into account, and may have bugs in it, plus you need to define the way you save the position):
a = open("A")
b = open("B")
# init
(chromA,start,end) = a.readline().split(",")
(chromB,pos) = b.readline().split(",")
while 1:
try:
if pos < start:
# Read more lines from B
(chromB,pos) = b.readline().split(",")
elif pos > end:
# Read more lines from A
(chromA,start,end) = a.readline().split(",")
else:
# Pos is in the [start, end] interval
# Save the result and go to the next pos
savePos()
(chromB,pos) = b.readline().split(",")
except StopIteration:
break
My real advice though would be to not do this manually like every bioinformatician in the past, but to use "bedtools" or existing Python libraries like BioPython. Reinventing the wheel is a terrible loss of time and resources for research.
I've interpreted your question as 'are the POS values in each row of B between any of the START and END values in all rows of A?' If that's correct, something like this will work, using the pandas data analysis library:
import pandas as pd
A = pd.read_csv('path/to/A.csv')
B = pd.read_csv('path/to/B.csv')
for i, r in B.iterrows():
B.loc[i, 'in_A'] = any((r.POS > A.START) & (r.POS < A.END))
print(B)
>> CHR POS in_A
0 chr1 1579264 False
1 chr1 1641372 False
2 chr1 3020521 False
3 chr2 2097836 False
4 chr3 2374462 False
This iterates through every row in B, and generates a boolean array that is TRUE where (POS > START) & (POS < END) for all rows of A. It then makes a new column in the B dataframe, which is True if POS exists within any of the START/END limits in A.
Make sense?

Find if rows in a large file contain a substring from a seperate list?

I have a large (30GB) file consisting of random terms and sentences. I have two separate lists of words and phrases I want to apply to that and mark (or alternatively filter) a row in which a term for that list appears.
If a term from list X appears in a row of the large file, mark it X, if from list Y, mark it Y. When done, take the rows marked X and output it to a file, same for Y as a separate file. My problem is that both of my lists are 1500 terms long and take a while to go through line by line.
After fidgeting around a while, I've arrived on my current method which filters the chunks on whether it contains a term. My issue is that it is very slow. I am wondering if there is a way to speed up my script to get through it faster? I was using Pandas to process the file in chunks of 1 million rows, it takes around 3 minutes to process a chunk with my current method:
white_text_file = open('lists/health_whitelist_final.txt', "r")
white_list = white_text_file.read().split(',\n')
black_text_file = open('lists/health_blacklist_final.txt', "r")
black_list = black_text_file.read().split(',\n')
for chunk in pd.read_csv('final_cleaned_corpus.csv', chunksize=chunksize, names=['Keyword']):
print("Chunk")
chunk_y = chunk[chunk['Keyword'].str.contains('|'.join(white_list), na=False)]
chunk_y.to_csv(VERTICAL+'_y_list.csv', mode='a', header=None)
chunk_x = chunk[chunk['Keyword'].str.contains('|'.join(black_list), na=False)]
chunk_x.to_csv(VERTICAL+'_x_list.csv', mode='a', header=None)
My first attempt was less pythonic but the aim was to break the loop the first time an element appears, this way was slower that my current one:
def x_or_y(keyword):
#print(keyword)
iden = ''
for item in y_list:
if item in keyword:
iden = 'Y'
break
for item in x_list:
if item in keyword:
iden = 'X'
break
return iden
Is there a faster way I'm missing here?
If I understand correctly, you need to do the following:
fileContent = ['yes','foo','junk','yes','foo','junk']
x_list = ['yes','no','maybe-so']
y_list = ['foo','bar','fizzbuzz']
def x_or_y(keyword):
if keyword in x_list:
return 'X'
if keyword in y_list:
return 'Y'
return ''
results = map(x_or_y, fileContent)
print(list(results))
Here is an example: https://repl.it/F566

Efficient lookup by common words

I have a list of names (strings) divided into words. There are 8 million names, each name consists of up to 20 words (tokens). Number of unique tokens is 2.2 million. I need an efficient way to find all names containing at least one word from the query (which may contain also up to 20 words, but usually only a few).
My current approach uses Python Pandas and looks like this (later referred as original):
>>> df = pd.DataFrame([['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True) # btw, is there a way to include this into prev line?
>>> print df
0 1 2
id
id1 foo bar joe
id2 foo None None
id3 bar joe None
id4 zoo None None
def filter_by_tokens(df, tokens):
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
return pd.concat(results).reset_index().drop_duplicates().set_index(df.index.name)
>>> print filter_by_tokens(df, ['foo', 'zoo'])
0 1 2
id
id1 foo bar joe
id2 foo None None
id4 zoo None None
Currently such lookup (on the full dataset) takes 5.75s on my (rather powerful) machine. I'd like to speed it up at least, say, 10 times.
I was able to get to 5.29s by squeezing all columns into one and perform a lookup on that (later referred as original, squeezed):
>>> df = pd.Series([{'foo', 'bar', 'joe'},
{'foo'},
{'bar', 'joe'},
{'zoo'}],
index=['id1', 'id2', 'id3', 'id4'])
>>> df.index.rename('id', inplace=True)
>>> print df
id
id1 {foo, bar, joe}
id2 {foo}
id3 {bar, joe}
id4 {zoo}
dtype: object
def filter_by_tokens(df, tokens):
return df[df.map(lambda x: bool(x & set(tokens)))]
>>> print filter_by_tokens(df, ['foo', 'zoo'])
id
id1 {foo, bar, joe}
id2 {foo}
id4 {zoo}
dtype: object
But that's still not fast enough.
Another solution which seems to be easy to implement is to use Python multiprocessing (threading shouldn't help here because of GIL and there is no I/O, right?). But the problem with it is that the big dataframe needs to be copied to each process, which takes up all the memory. Another problem is that I need to call filter_by_tokens many times in a loop, so it would copy the dataframe on every call, which is inefficient.
Note that words may occur many times in names (e.g. the most popular word occurs 600k times in names), so a reverse index would be huge.
What is a good way to write this efficiently? Python solution preferred, but I'm also open to other languages and technologies (e.g. databases).
UPD:
I've measured execution time of my two solutions and the 5 solutions suggested by #piRSquared in his answer. Here are the results (tl;dr the best is 2x improvement):
+--------------------+----------------+
| method | best of 3, sec |
+--------------------+----------------+
| original | 5.75 |
| original, squeezed | 5.29 |
| zip | 2.54 |
| merge | 8.87 |
| mul+any | MemoryError |
| isin | IndexingError |
| query | 3.7 |
+--------------------+----------------+
mul+any gives MemoryError on d1 = pd.get_dummies(df.stack()).groupby(level=0).sum() (on a 128Gb RAM machine).
isin gives IndexingError: Unalignable boolean Series key provided on s[d1.isin({'zoo', 'foo'}).unstack().any(1)], apparently because shape of df.stack().isin(set(tokens)).unstack() is slightly less than the shape of the original dataframe (8.39M vs 8.41M rows), not sure why and how to fix that.
Note that the machine I'm using has 12 cores (though I mentioned some problems with parallelization above). All of the solutions utilize a single core.
Conclusion (as of now): there is 2.1x improvement by zip (2.54s) vs original squeezed solution (5.29s). It's good, though I aimed for at least 10x improvement, if possible. So I'm leaving the (still great) #piRSquared answer unaccepted for now, to welcome more suggestions.
idea 0
zip
def pir(s, token):
return s[[bool(p & token) for p in s]]
pir(s, {'foo', 'zoo'})
idea 1
merge
token = pd.DataFrame(dict(v=['foo', 'zoo']))
d1 = df.stack().reset_index('id', name='v')
s.ix[d1.merge(token).id.unique()]
idea 2
mul + any
d1 = pd.get_dummies(df.stack()).groupby(level=0).sum()
token = pd.Series(1, ['foo', 'zoo'])
s[d1.mul(token).any(1)]
idea 3
isin
d1 = df.stack()
s[d1.isin({'zoo', 'foo'}).unstack().any(1)]
idea 4
query
token = ('foo', 'zoo')
d1 = df.stack().to_frame('s')
s.ix[d1.query('s in #token').index.get_level_values(0).unique()]
I have done similar things with the following tools
Hbase - Key can have Multiple columns (Very Fast)
ElasticSearch - Nice easy to scale. You just need to import your data as JSON
Apache Lucene - Will be very good for 8 Million records
You can do it with reverse index; the code below run in pypy builds the index in 57 seconds, does the query or 20 words takes 0.00018 seconds and used about 3.2Gb memory. Python 2.7 build index in 158 seconds and does query in 0.0013 seconds using about 3.41Gb memory.
The fastest possible way to do this is with bitmapped reversed indexes, compressed to save space.
"""
8m records with between 1 and 20 words each, selected at random from 100k words
Build dictionary of sets, keyed by word number, set contains nos of all records
with that word
query merges the sets for all query words
"""
import random
import time records = 8000000
words = 100000
wordlists = {}
print "build wordlists"
starttime = time.time()
wordlimit = words - 1
total_words = 0
for recno in range(records):
for x in range(random.randint(1,20)):
wordno = random.randint(0,wordlimit)
try:
wordlists[wordno].add(recno)
except:
wordlists[wordno] = set([recno])
total_words += 1
print "build time", time.time() - starttime, "total_words", total_words
querylist = set()
query = set()
for x in range(20):
while 1:
wordno = (random.randint(0,words))
if wordno in wordlists: # only query words that were used
if not wordno in query:
query.add(wordno)
break
print "query", query
starttime = time.time()
for wordno in query:
querylist.union(wordlists[wordno])
print "query time", time.time() - starttime
print "count = ", len(querylist)
for recno in querylist:
print "record", recno, "matches"
perhaps my first answer was a bit abstract; in the absence of real data it was generating random data to the approx. volume reqd. to get a feel for the query time. This code is practical.
data =[['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']]
wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
for word in d:
try:
wordlists[word].add(x)
except:
wordlists[word] = set([x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = wordlists.get(q)
if wordlist:
results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
print data[x]
The cost in time and memory is building the wordlists (inverted indices); query is almost free. You have 12 core machine so presumably it has plenty of memory. For repeatability, build the wordlists, pickle each wordlist and write to sqlite or any key/value database with the word as key and the pickled set as binary blob. then all you need is:
initialise_database()
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = get_wordlist_from_database(q) # get binary blob and unpickle
if wordlist:
results = results.union(wordlist)
l = list(results)
l.sort()
for x in l:
print data[x]
alternatively, using arrays, which is more memory efficient, and probably faster to build index. pypy more that 10 x faster that 2.7
import array
data =[['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo']]
wordlists = {}
print "build wordlists"
for x, d in enumerate(data):
for word in d:
try:
wordlists[word].append(x)
except:
wordlists[word] = array.array("i",[x])
print "query"
query = [ "foo", "zoo" ]
results = set()
for q in query:
wordlist = wordlists.get(q)
if wordlist:
for i in wordlist:
results.add(i)
l = list(results)
l.sort()
for x in l:
print data[x]
If you know that the number of unique tokens that you'll see is relatively small,
you can pretty easily build an efficient bitmask to query for matches.
The naive approach (in original post) will allow for up to 64 distinct tokens.
The improved code below uses the bitmask like a bloom filter (modular arithmetic in setting the bits wraps around 64). If there are more than 64 unique tokens, there will be some false positives, which the code below will automatically verify (using the original code).
Now the worst-case performance will degrade if the number of unique tokens is (much) larger than 64, or if you get particularly unlucky. Hashing could mitigate this.
As far as performance goes, using the benchmark data set below, I get:
Original Code: 4.67 seconds
Bitmask Code: 0.30 seconds
However, when the number of unique tokens is increased, the bitmask code remains efficient while the original code slows down considerably. With about 70 unique tokens, I get something like:
Original Code: ~15 seconds
Bitmask Code: 0.80 seconds
Note: for this latter case, building the bitmask array from the supplied list, takes about as much time as building the dataframe. There's probably no real reason to build the dataframe; left it in mainly for ease of comparison to original code.
class WordLookerUpper(object):
def __init__(self, token_lists):
tic = time.time()
self.df = pd.DataFrame(token_lists,
index=pd.Index(
data=['id%d' % i for i in range(len(token_lists))],
name='index'))
print('took %d seconds to build dataframe' % (time.time() - tic))
tic = time.time()
dii = {}
iid = 0
self.bits = np.zeros(len(token_lists), np.int64)
for i in range(len(token_lists)):
for t in token_lists[i]:
if t not in dii:
dii[t] = iid
iid += 1
# set the bit; note that b = dii[t] % 64
# this 'wrap around' behavior lets us use this
# bitmask as a probabilistic filter
b = dii[t]
self.bits[i] |= (1 << b)
self.string_to_iid = dii
print('took %d seconds to build bitmask' % (time.time() - tic))
def filter_by_tokens(self, tokens, df=None):
if df is None:
df = self.df
tic = time.time()
# search within each column and then concatenate and dedup results
results = [df.loc[lambda df: df[i].isin(tokens)] for i in range(df.shape[1])]
results = pd.concat(results).reset_index().drop_duplicates().set_index('index')
print('took %0.2f seconds to find %d matches using original code' % (
time.time()-tic, len(results)))
return results
def filter_by_tokens_with_bitmask(self, search_tokens):
tic = time.time()
bitmask = np.zeros(len(self.bits), np.int64)
verify = np.zeros(len(self.bits), np.int64)
verification_needed = False
for t in search_tokens:
bitmask |= (self.bits & (1<<self.string_to_iid[t]))
if self.string_to_iid[t] > 64:
verification_needed = True
verify |= (self.bits & (1<<self.string_to_iid[t]))
if verification_needed:
results = self.df[(bitmask > 0 & ~verify.astype(bool))]
results = pd.concat([results,
self.filter_by_tokens(search_tokens,
self.df[(bitmask > 0 & verify.astype(bool))])])
else:
results = self.df[bitmask > 0]
print('took %0.2f seconds to find %d matches using bitmask code' % (
time.time()-tic, len(results)))
return results
Make some test data
unique_token_lists = [
['foo', 'bar', 'joe'],
['foo'],
['bar', 'joe'],
['zoo'],
['ziz','zaz','zuz'],
['joe'],
['joey','joe'],
['joey','joe','joe','shabadoo']
]
token_lists = []
for n in range(1000000):
token_lists.extend(unique_token_lists)
Run the original code and the bitmask code
>>> wlook = WordLookerUpper(token_lists)
took 5 seconds to build dataframe
took 10 seconds to build bitmask
>>> wlook.filter_by_tokens(['foo','zoo']).tail(n=1)
took 4.67 seconds to find 3000000 matches using original code
id7999995 zoo None None None
>>> wlook.filter_by_tokens_with_bitmask(['foo','zoo']).tail(n=1)
took 0.30 seconds to find 3000000 matches using bitmask code
id7999995 zoo None None None

Beyond for-looping: high performance parsing of a large, well formatted data file

I am looking to optimize the performance of a big data parsing problem I have using python. In case anyone is interested: the data shown below is segments of whole genome DNA sequence alignments for six primate species.
Currently, the best way I know how to proceed with this type of problem is to open each of my ~250 (size 20-50MB) files, loop through line by line and extract the data I want. The formatting (shown in examples) is fairly regular although there are important changes at each 10-100 thousand line segment. Looping works fine but it is slow.
I have been using numpy recently for processing massive (>10 GB) numerical data sets and I am really impressed at how quickly I am able to perform different computations on arrays. I wonder if there are some high-powered solutions for processing formatted text that circumvents tedious for-looping?
My files contain multiple segments with the pattern
<MULTI-LINE HEADER> # number of header lines mirrors number of data columns
<DATA BEGIN FLAG> # the word 'DATA'
<DATA COLUMNS> # variable number of columns
<DATA END FLAG> # the pattern '//'
<EMPTY LINE>
Example:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_data
SEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)
SEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)
SEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)
DATA
GGGGGG
CCCCTC
...... # continue for 10-100 thousand lines
//
SEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)
SEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)
SEQ gorilla_gorilla 1 11562256 11579393 1 (chr_length=229966203)
SEQ pongo_pygmaeus 1 219047970 219064053 -1 (chr_length=229942017)
DATA
CCCC
GGGG
.... # continue for 10-100 thousand lines
//
<ETC>
I will use segments where the species homo_sapiens and macaca_mulatta are both present in the header, and field 6, which I called the quality flag in the comments above, equals '1' for each species. Since macaca_mulatta does not appear in the second example, I would ignore this segment completely.
I care about segment_start and segment_end coordinates for homo_sapiens only, so in segments where homo_sapiens is present, I will record these fields and use them as keys to a dict(). segment_start also tells me the first positional coordinate for homo_sapiens, which increases strictly by 1 for each line of data in the current segment.
I want to compare the letters (DNA bases) for homo_sapiens and macaca_mulatta. The header line where homo_sapiens and macaca_mulatta appear (i.e. 1 and 5 in the first example) correspond to the column of data representing their respective sequences.
Importantly, these columns are not always the same, so I need to check the header to get the correct indices for each segment, and to check that both species are even in the current segment.
Looking at the two lines of data in example 1, the relevant information for me is
# homo_sapiens_coordinate homo_sapiens_base macaca_mulatta_base
11388669 G G
11388670 C T
For each segment containing info for homo_sapiens and macaca_mulatta, I will record start and end for homo_sapiens from the header and each position where the two DO NOT match into a list. Finally, some positions have "gaps" or lower quality data, i.e.
aaa--A
I will only record from positions where homo_sapiens and macaca_mulatta both have valid bases (must be in the set ACGT) so the last variable I consider is a counter of valid bases per segment.
My final data structure for a given file is a dictionary which looks like this:
{(segment_start=i, segment_end=j, valid_bases=N): list(mismatch positions),
(segment_start=k, segment_end=l, valid_bases=M): list(mismatch positions), ...}
Here is the function I have written to carry this out using a for-loop:
def human_macaque_divergence(chromosome):
"""
A function for finding the positions of human-macaque divergent sites within segments of species alignment tracts
:param chromosome: chromosome (integer:
:return div_dict: a dictionary with tuple(segment_start, segment_end, valid_bases_in_segment) for keys and list(divergent_sites) for values
"""
ch = str(chromosome)
div_dict = {}
with gz.open('{al}Compara.6_primates_EPO.chr{c}_1.emf.gz'.format(al=pd.align, c=ch), 'rb') as f:
# key to the header fields:
# header_flag chromosome segment_start segment_end quality_flag chromosome_info
# SEQ homo_sapiens 1 14163 24841 1 (chr_length=249250621)
# flags, containers, counters and indices:
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False # a flag signalling that the lines we are viewing are alignment columns
for line in f:
if 'SEQ' in line: # 'SEQ' signifies a segment info field
assert species_data is False
line = line.split()
if line[2] == ch and line[5] == '1': # make sure that the alignment is to the desired chromosome in humans quality_flag is '1'
species += [line[1]] # collect each species in the header
starts += [int(line[3])] # collect starts and ends
ends += [int(line[4])]
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
# get the indices to scan in data columns:
hom = species.index('homo_sapiens')
mac = species.index('macaca_mulatta')
pos = starts[hom] # first homo_sapiens positional coordinate
continue
if species_data and '//' not in line:
assert pos > 0
# record the relevant bases:
human = line[hom]
macaque = line[mac]
if {human, macaque}.issubset(bases):
valid += 1
if human != macaque and {human, macaque}.issubset(bases):
mismatch += [pos]
pos += 1
elif species_data and '//' in line: # '//' signifies segment boundary
# store segment results if a boundary has been reached and data has been collected for the last segment:
div_dict[(starts[hom], ends[hom], valid)] = mismatch
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
mismatch = []
valid = 0
pos = -1
hom = None
mac = None
species_data = False
elif not species_data and '//' in line:
# reset flags, containers, counters and indices
species = []
starts = []
ends = []
pos = -1
hom = None
mac = None
return div_dict
This code works fine (perhaps it could use some tweaking), but my real question is whether or not there might be a faster way to pull this data without running the for-loop and examining each line? For example, loading the whole file using f.read() takes less than a second although it creates a pretty complicated string. (In principle, I assume that I could use regular expressions to parse at least some of the data, such as the header info, but I'm not sure if this would necessarily increase performance without some bulk method to process each data column in each segment).
Does anyone have any suggestions as to how I circumvent looping through billions of lines and parse this kind of text file in a more bulk manner?
Please let me know if anything is unclear in comments, happy to edit or respond directly to improve the post!
Yes you could use some regular expressions to make extract the data in one-go; this is probably the best ratio of effort/performances.
If you need more performances, you could use mx.TextTools to build a finite state machine; I'm pretty confident this will be significantly faster, but the effort needed to write the rules and the learning curve might discourage you.
You also could split the data in chunks and parallelize the processing, this could help.
When you have working code and need to improve performance, use a profiler and measure the effect of one optimization at a time. (Even if you don't use the profiler, definitely do the latter.) Your present code looks reasonable, that is, I don't see anything "stupid" in it in terms of performance.
Having said that, it is likely to be worthwhile to use precompiled regular expressions for all string matching. By using re.MULTILINE, you can read in an entire file as a string and pull out parts of lines. For example:
s = open('file.txt').read()
p = re.compile(r'^SEQ\s+(\w+)\s+(\d+)\s+(\d+)\s+(\d+)', re.MULTILINE)
p.findall(s)
produces:
[('homo_sapiens', '1', '11388669', '11532963'),
('pan_troglodytes', '1', '11517444', '11668750'),
('gorilla_gorilla', '1', '11607412', '11751006'),
('pongo_pygmaeus', '1', '218866021', '219020464'),
('macaca_mulatta', '1', '14425463', '14569832'),
('callithrix_jacchus', '7', '45949850', '46115230')]
You will then need to post-process this data to deal with the specific conditions in your code, but the overall result may be faster.
Your code looks good, but there are particular things that could be improved, such as the use of map, etc.
For good guide on performance tips in Python see:
https://wiki.python.org/moin/PythonSpeed/PerformanceTips
I have used the above tips to get code working nearly as fast as C code. Basically, try to avoid for loops (use map), try to use find built-in functions, etc. Make Python work for you as much as possible by using its builtin functions, which are largely written in C.
Once you get acceptable performance you can run in parallel using:
https://docs.python.org/dev/library/multiprocessing.html#module-multiprocessing
Edit:
I also just realized you are opening a compressed gzip file. I suspect a significant amount of time is spent decompressing it. You can try to make this faster by multi-threading it with:
https://code.google.com/p/threadzip/
You can combine re with some fancy zipping in list comprehensions that can replace the for loops and try to squeeze some performance gains. Below I outline a strategy for segmenting the data file read in as an entire string:
import re
from itertools import izip #(if you are using py2x like me, otherwise just use zip for py3x)
s = open('test.txt').read()
Now find all header lines, and the corresponding index ranges in the large string
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
#head
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)',
# '\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
#...
#head_inds
#[(107, 169),
# (169, 234),
# (234, 299),
# (299, 366),
# (366, 430),
# (430, 498),
# (1035, 1097),
# (1097, 1162)
# ...
Now, do the same for the data (lines of code with bases)
data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ d[0] for d in data_info]
data_inds = [ (d[1],d[2]) for d in data_info]
Now, whenever there is a new segment, there will be a discontinuity between head_inds[i][1] and head_inds[i+1][0]. Same for data_inds. We can use this knowledge to find the beginning and end of each segment as follows
head_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( head_inds[:-1], head_inds[1:])) if j[0]-i[1]]
head_seg_pos = [0] + head_seg_pos + [len(head_seg_pos)] # add beginning and end which we will use next
head_segmented = [ head[s1:s2] for s1,s2 in izip( head_seg_pos[:-1], head_seg_pos[1:]) ]
#[['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)'],
#['\nSEQ homo_sapiens 1 11345717 11361846 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11474525 11490638 1 (chr_length=229974691)',
# ...
and the same for the data
data_seg_pos = [ idx+1 for idx,(i,j) in enumerate( izip( data_inds[:-1], data_inds[1:])) if j[0]-i[1]]
data_seg_pos = [0] + data_seg_pos + [len(data_inds)] # add beginning and end for the next step
data_segmented = [ data[s1:s2] for s1,s2 in izip( data_seg_pos[:-1], data_seg_pos[1:]) ]
Now we can group the segmented data and segmented headers, and only keep groups with data on homo_sapiens and macaca_mulatta
groups = [ [h,d] for h,d in izip( head_segmented, data_segmented) if all( [sp in ''.join(h) for sp in ('homo_sapiens','macaca_mulatta')] ) ]
Now you have a groups array, where each group has
group[0][0] #headers for segment 0
#['\nSEQ homo_sapiens 1 11388669 11532963 1 (chr_length=249250621)',
# '\nSEQ pan_troglodytes 1 11517444 11668750 1 (chr_length=229974691)',
# '\nSEQ gorilla_gorilla 1 11607412 11751006 1 (chr_length=229966203)',
# '\nSEQ pongo_pygmaeus 1 218866021 219020464 -1 (chr_length=229942017)',
# '\nSEQ macaca_mulatta 1 14425463 14569832 1 (chr_length=228252215)',
# '\nSEQ callithrix_jacchus 7 45949850 46115230 1 (chr_length=155834243)']
groups[0][1] # data from segment 0
#['\nGGGGGG',
# '\nCCCCTC',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# '\nGGGGGG',
# ...
The next step in the processing I will leave up to you, so I don't steal all the fun. But hopefully this gives you a good idea on using list comprehension to optimize code.
Update
Consider the simple test case to gauge efficiency of the comprehensions combined with re:
def test1():
with open('test.txt','r') as f:
head = []
for line in f:
if line.startswith('SEQ'):
head.append( line)
return head
def test2():
s = open('test.txt').read()
head = re.findall( '\nSEQ.*', s)
return head
%timeit( test1() )
10000 loops, best of 3: 78 µs per loop
%timeit( test2() )
10000 loops, best of 3: 37.1 µs per loop
Even if I gather additional information using re
def test3():
s = open('test.txt').read()
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
%timeit( test3() )
10000 loops, best of 3: 50.6 µs per loop
I still get speed gains. I believe this may be faster in your case to use list comprehensions. However, the for loop might actually beat the comprehension (I take back what I said before) in end, consider
def test1(): #similar to how you are reading in the data in your for loop above
with open('test.txt','r') as f:
head = []
data = []
species = []
species_data = False
for line in f:
if line.startswith('SEQ'):
head.append( line)
species.append( line.split()[1] )
continue
if 'DATA' in line and {'homo_sapiens', 'macaca_mulatta'}.issubset(species):
species_data = True
continue
if species_data and '//' not in line:
data.append( line )
continue
if species_data and line.startswith( '//' ):
species_data = False
species = []
continue
return head, data
def test3():
s = open('test.txt').read()
head_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\nSEQ.*', s)]
head = [ h[0] for h in head_info]
head_inds = [ (h[1],h[2]) for h in head_info]
data_info = [(s[m.start():m.end()],m.start(), m.end()) for m in re.finditer('\n[AGCT-]+.*', s)]
data = [ h[0] for h in data_info]
data_inds = [ (h[1],h[2]) for h in data_info]
return head,data
In this case, as the iterations become more complex, the traditional for loop wins
In [24]: %timeit(test1())
10000 loops, best of 3: 135 µs per loop
In [25]: %timeit(test3())
1000 loops, best of 3: 256 µs per loop
Though I can still use re.findall twice and beat the for loop:
def test4():
s = open('test.txt').read()
head = re.findall( '\nSEQ.*',s )
data = re.findall( '\n[AGTC-]+.*',s)
return head,data
In [37]: %timeit( test4() )
10000 loops, best of 3: 79.5 µs per loop
I guess as the processing of each iteration becomes increasingly complex, the for loop will win, though there might be a more clever way to continue on with re. I wish there was a standard way to determine when to use either.
File processing with Numpy
The data itself appears to be completely regular and can be processed easily with Numpy. The header is only a tiny part of the file and the processing speed thereof is not very relevant. So the idea is to swich to Numpy for only the raw data and, other than that, keep the existing loops in place.
This approach works best if the number of lines in a data segment can be determined from the header. For the remainder of this answer I assume this is indeed the case. If this is not possible, the starting and ending points of data segments have to be determined with e.g. str.find or regex. This will still run at "compiled C speed" but the downside being that the file has to be looped over twice. In my opinion, if your files are only 50MB it's not a big problem to load a complete file into RAM.
E.g. put something like the following under if species_data and '//' not in line:
# Define `import numpy as np` at the top
# Determine number of rows from header data. This may need some
# tuning, if possible at all
nrows = max(ends[i]-starts[i] for i in range(len(species)))
# Sniff line length, because number of whitespace characters uncertain
fp = f.tell()
ncols = len(f.readline())
f.seek(fp)
# Load the data without loops. The file.read method can do the same,
# but with numpy.fromfile we have it in an array from the start.
data = np.fromfile(f, dtype='S1', count=nrows*ncols)
data = data.reshape(nrows, ncols)
# Process the data without Python loops. Here we leverage Numpy
# to really speed up the processing.
human = data[:,hom]
macaque = data[:,mac]
valid = np.in1d(human, bases) & np.in1d(macaque, bases)
mismatch = (human != macaque)
pos = starts[hom] + np.flatnonzero(valid & mismatch)
# Store
div_dict[(starts[hom], ends[hom], valid.sum())] = pos
# After using np.fromfile above, the file pointer _should_ be exactly
# in front of the segment termination flag
assert('//' in f.readline())
# Reset the header containers and flags
...
So the elif species_data and '//' in line: case has become redundant and the containers and flags can be reset in the same block as the above. Alternatively, you could also remove the assert('//' in f.readline()) and keep the elif species_data and '//' in line: case and reset containers and flags there.
Caveats
For relying on the file pointer to switch between processing the header and the data, there is one caveat: (in CPython) iterating a file object uses a read-ahead buffer causing the file pointer to be further down the file than you'd expect. When you would then use numpy.fromfile with that file pointer, it skips over data at the start of the segment and moreover it reads into the header of the next segment. This can be fixed by exclusively using the file.readline method. We can conveniently use it as an iterator like so:
for line in iter(f.readline, ''):
...
For determining the number of bytes to read with numpy.fromfile there is another caveat: Sometimes there is a single line termination character \n at the end of a line and other times two \r\n. The first is the convention on Linux/OSX and the latter on Windows. There is os.linesep to determine the default, but obviously for file parsing this isn't robust enough. So in the code above the length of a data line is determined by actually reading a line, checking the len, and putting back the file pointer to the start of the line.
When you encounter a data segment ('DATA' in line) and the desired species are not in it, you should be able to calculate an offset and f.seek(f.tell() + offset) to the header of the next segment. Much better than looping over data you're not even interested in!

For-loop to count differences in lines with python

I have a file filled with lines like this (this is just a small bit of the file):
9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae
The number refers to a cluster, and then it goes 'Genus' 'Species' 'Family'.
What I want to do is write a program that will look through each line and report back to me: a list of the different genera in each cluster, and how many of each of those genera are in the cluster. So I'm interested in cluster number and the first 'word' in each line.
My trouble is that I'm not sure how to get this information. I think I need to use a for-loop, starting at lines that begin with '0.'The output would be a file that looks something like:
Cluster 0: Brucella(2) # Lists cluster, followed by genera in cluster with number, something like that.
Cluster 1: Streptomyces(2)
Cluster 2: Brucella(1)
etc.
Eventually I want to do the same kind of count with the Families in each cluster, and then Genera and Species together. Any thoughts on how to start would be greatly appreciated!
I thought this would be a fun little toy project, so I wrote a little hack to read in an input file like yours from stdin, count and format the output recursively and spit out output that looks a little like yours, but with a nested format, like so:
Cluster 0:
Brucella(2)
melitensis(1)
Brucellaceae(1)
neotomae(1)
Brucellaceae(1)
Streptomyces(1)
neotomae(1)
Brucellaceae(1)
Cluster 1:
Streptomyces(2)
geysiriensis(1)
Streptomycetaceae(1)
minutiscleroticus(1)
Streptomycetaceae(1)
Cluster 2:
Mycobacterium(1)
phocaicum(1)
Mycobacteriaceae(1)
Cluster 7:
Mycobacterium(2)
gastri(1)
Mycobacteriaceae(1)
kansasii(1)
Mycobacteriaceae(1)
Cluster 9:
Hyphomicrobium(2)
facile(2)
Hyphomicrobiaceae(2)
Cluster 10:
Streptomyces(2)
niger(1)
Streptomycetaceae(1)
olivaceiscleroticus(1)
Streptomycetaceae(1)
I also added some junk data to my table so that I could see an extra entry in Cluster 0, separated from the other two... The idea here is that you should be able to see a top level "Cluster" entry and then nested, indented entries for genus, species, family... it shouldn't be hard to extend for deeper trees, either, I hope.
# Sys for stdio stuff
import sys
# re for the re.split -- this can go if you find another way to parse your data
import re
# A global (shame on me) for storing the data we're going to parse from stdin
data = []
# read lines from standard input until it's empty (end-of-file)
for line in sys.stdin:
# Split lines on spaces (gobbling multiple spaces for robustness)
# and trim whitespace off the beginning and end of input (strip)
entry = re.split("\s+", line.strip())
# Throw the array into my global data array, it'll look like this:
# [ "0", "Brucella", "melitensis", "Brucellaceae" ]
# A lot of this code assumes that the first field is an integer, what
# you call "cluster" in your problem description
data.append(entry)
# Sort, first key is expected to be an integer, and we want a numerical
# sort rather than a string sort, so convert to int, then sort by
# each subsequent column. The lamba is a function that returns a tuple
# of keys we care about for each item
data.sort(key=lambda item: (int(item[0]), item[1], item[2], item[3]))
# Our recursive function -- we're basically going to treat "data" as a tree,
# even though it's not.
# parameters:
# start - an integer telling us what line to begin working from so we needn't
# walk the whole tree each time to figure out where we are.
# super - An array that captures where we are in the search. This array
# will have more elements in it as we deepen our traversal of the "tree"
# Initially, it will be []
# In the next ply of the tree, it will be [ '0' ]
# Then something like [ '0', 'Brucella' ] and so on.
# data - The global data structure -- this never mutates after the sort above,
# I could have just used the global directly
def groupedReport(start, super, data):
# Figure out what ply we're on in our depth-first traversal of the tree
depth = len(super)
# Count entries in the super class, starting from "start" index in the array:
count = 0
# For the few records in the data file that match our "super" exactly, we count
# occurrences.
if depth != 0:
for i in range(start, len(data)):
if (data[i][0:depth] == data[start][0:depth]):
count = count + 1
else:
break; # We can stop counting as soon as a match fails,
# because of the way our input data is sorted
else:
count = len(data)
# At depth == 1, we're reporting about clusters -- this is the only piece of
# the algorithm that's not truly abstract, and it's only for presentation
if (depth == 1):
sys.stdout.write("Cluster " + super[0] + ":\n")
elif (depth > 0):
# Every other depth: indent with 4 spaces for every ply of depth, then
# output the unique field we just counted, and its count
sys.stdout.write((' ' * ((depth - 1) * 4)) +
data[start][depth - 1] + '(' + str(count) + ')\n')
# Recursion: we're going to figure out a new depth and a new "super"
# and then call ourselves again. We break out on depth == 4 because
# of one other assumption (I lied before about the abstract thing) I'm
# making about our input data here. This could
# be made more robust/flexible without a lot of work
subsuper = None
substart = start
for i in range(start, start + count):
record = data[i] # The original record from our data
newdepth = depth + 1
if (newdepth > 4): break;
# array splice creates a new copy
newsuper = record[0:newdepth]
if newsuper != subsuper:
# Recursion here!
groupedReport(substart, newsuper, data)
# Track our new "subsuper" for subsequent comparisons
# as we loop through matches
subsuper = newsuper
# Track position in the data for next recursion, so we can start on
# the right line
substart = substart + 1
# First call to groupedReport starts the recursion
groupedReport(0, [], data)
If you make my Python code into a file like "classifier.py", then you can run your input.txt file (or whatever you call it) through it like so:
cat input.txt | python classifier.py
Most of the magic of the recursion, if you care, is implemented using slices of arrays and leans heavily on the ability to compare array slices, as well as the fact that I can order the input data meaningfully with my sort routine. You may want to convert your input data to all-lowercase, if it is possible that case inconsistencies could yield mismatches.
It is easy to do.
create an empty dict {} to store your result, lets call it 'result'
Loop over the data line by line.
Split the line on space to get 4 elements as per your structure, cluster,genus,species,family
Increment counts of genus inside each cluster key when they are found in the current loop, they have to be set to 1 for the first occurence though.
result = { '0': { 'Brucella': 2} ,'1':{'Streptomyces':2}..... }
Code:
my_data = """9 Hyphomicrobium facile Hyphomicrobiaceae
9 Hyphomicrobium facile Hyphomicrobiaceae
7 Mycobacterium kansasii Mycobacteriaceae
7 Mycobacterium gastri Mycobacteriaceae
10 Streptomyces olivaceiscleroticus Streptomycetaceae
10 Streptomyces niger Streptomycetaceae
1 Streptomyces geysiriensis Streptomycetaceae
1 Streptomyces minutiscleroticus Streptomycetaceae
0 Brucella neotomae Brucellaceae
0 Brucella melitensis Brucellaceae
2 Mycobacterium phocaicum Mycobacteriaceae"""
result = {}
for line in my_data.split("\n"):
cluster,genus,species,family = line.split(" ")
result.setdefault(cluster,{}).setdefault(genus,0)
result[cluster][genus] += 1
print(result)
{'10': {'Streptomyces': 2}, '1': {'Streptomyces': 2}, '0': {'Brucella': 2}, '2': {'Mycobacterium': 1}, '7': {'Mycobacterium': 2}, '9': {'Hyphomicrobium': 2}}

Categories