I need some help getting my brain around designing an (efficient) markov chain in spark (via python). I've written it as best as I could, but the code I came up with doesn't scale.. Basically for the various map stages, I wrote custom functions and they work fine for sequences of a couple thousand, but when we get in the 20,000+ (and I've got some up to 800k) things slow to a crawl.
For those of you not familiar with markov moodels, this is the gist of it..
This is my data.. I've got the actual data (no header) in an RDD at this point.
ID, SEQ
500, HNL, LNH, MLH, HML
We look at sequences in tuples, so
(HNL, LNH), (LNH,MLH), etc..
And I need to get to this point.. where I return a dictionary (for each row of data) that I then serialize and store in an in memory database.
{500:
{HNLLNH : 0.333},
{LNHMLH : 0.333},
{MLHHML : 0.333},
{LNHHNL : 0.000},
etc..
}
So in essence, each sequence is combined with the next (HNL,LNH become 'HNLLNH'), then for all possible transitions (combinations of sequences) we count their occurrence and then divide by the total number of transitions (3 in this case) and get their frequency of occurrence.
There were 3 transitions above, and one of those was HNLLNH.. So for HNLLNH, 1/3 = 0.333
As a side not, and I'm not sure if it's relevant, but the values for each position in a sequence are limited.. 1st position (H/M/L), 2nd position (M/L), 3rd position (H,M,L).
What my code had previously done was to collect() the rdd, and map it a couple times using functions I wrote. Those functions first turned the string into a list, then merged list[1] with list[2], then list[2] with list[3], then list[3] with list[4], etc.. so I ended up with something like this..
[HNLLNH],[LNHMLH],[MHLHML], etc..
Then the next function created a dictionary out of that list, using the list item as a key and then counted the total ocurrence of that key in the full list, divided by len(list) to get the frequency. I then wrapped that dictionary in another dictionary, along with it's ID number (resulting in the 2nd code block, up a above).
Like I said, this worked well for small-ish sequences, but not so well for lists with a length of 100k+.
Also, keep in mind, this is just one row of data. I have to perform this operation on anywhere from 10-20k rows of data, with rows of data varying between lengths of 500-800,000 sequences per row.
Any suggestions on how I can write pyspark code (using the API map/reduce/agg/etc.. functions) to do this efficiently?
EDIT
Code as follows.. Probably makes sense to start at the bottom. Please keep in mind I'm learning this(Python and Spark) as I go, and I don't do this for a living, so my coding standards are not great..
def f(x):
# Custom RDD map function
# Combines two separate transactions
# into a single transition state
cust_id = x[0]
trans = ','.join(x[1])
y = trans.split(",")
s = ''
for i in range(len(y)-1):
s= s + str(y[i] + str(y[i+1]))+","
return str(cust_id+','+s[:-1])
def g(x):
# Custom RDD map function
# Calculates the transition state probabilities
# by adding up state-transition occurrences
# and dividing by total transitions
cust_id=str(x.split(",")[0])
trans = x.split(",")[1:]
temp_list=[]
middle = int((len(trans[0])+1)/2)
for i in trans:
temp_list.append( (''.join(i)[:middle], ''.join(i)[middle:]) )
state_trans = {}
for i in temp_list:
state_trans[i] = temp_list.count(i)/(len(temp_list))
my_dict = {}
my_dict[cust_id]=state_trans
return my_dict
def gen_tsm_dict_spark(lines):
# Takes RDD/string input with format CUST_ID(or)PROFILE_ID,SEQ,SEQ,SEQ....
# Returns RDD of dict with CUST_ID and tsm per customer
# i.e. {cust_id : { ('NLN', 'LNN') : 0.33, ('HPN', 'NPN') : 0.66}
# creates a tuple ([cust/profile_id], [SEQ,SEQ,SEQ])
cust_trans = lines.map(lambda s: (s.split(",")[0],s.split(",")[1:]))
with_seq = cust_trans.map(f)
full_tsm_dict = with_seq.map(g)
return full_tsm_dict
def main():
result = gen_tsm_spark(my_rdd)
# Insert into DB
for x in result.collect():
for k,v in x.iteritems():
db_insert(k,v)
You can try something like below. It depends heavily on tooolz but if you prefer to avoid external dependencies you can easily replace it with some standard Python libraries.
from __future__ import division
from collections import Counter
from itertools import product
from toolz.curried import sliding_window, map, pipe, concat
from toolz.dicttoolz import merge
# Generate all possible transitions
defaults = sc.broadcast(dict(map(
lambda x: ("".join(concat(x)), 0.0),
product(product("HNL", "NL", "HNL"), repeat=2))))
rdd = sc.parallelize(["500, HNL, LNH, NLH, HNL", "600, HNN, NNN, NNN, HNN, LNH"])
def process(line):
"""
>>> process("000, HHH, LLL, NNN")
('000', {'LLLNNN': 0.5, 'HHHLLL': 0.5})
"""
bits = line.split(", ")
transactions = bits[1:]
n = len(transactions) - 1
frequencies = pipe(
sliding_window(2, transactions), # Get all transitions
map(lambda p: "".join(p)), # Joins strings
Counter, # Count
lambda cnt: {k: v / n for (k, v) in cnt.items()} # Get frequencies
)
return bits[0], frequencies
def store_partition(iter):
for (k, v) in iter:
db_insert(k, merge([defaults.value, v]))
rdd.map(process).foreachPartition(store_partition)
Since you know all possible transitions I would recommend using a sparse representation and ignore zeros. Moreover you can replace dictionaries with sparse vectors to reduce memory footprint.
you can achieve this result by using pure Pyspark, i did using it using pyspark.
To create frequencies, let say you have already achieved and these are input RDDs
ID, SEQ
500, [HNL, LNH, MLH, HML ...]
and to get frequencies like, (HNL, LNH),(LNH, MLH)....
inputRDD..map(lambda (k, list): get_frequencies(list)).flatMap(lambda x: x) \
.reduceByKey(lambda v1,v2: v1 +v2)
get_frequencies(states_list):
"""
:param states_list: Its a list of Customer States.
:return: State Frequencies List.
"""
rest = []
tuples_list = []
for idx in range(0,len(states_list)):
if idx + 1 < len(states_list):
tuples_list.append((states_list[idx],states_list[idx+1]))
unique = set(tuples_list)
for value in unique:
rest.append((value, tuples_list.count(value)))
return rest
and you will get results
((HNL, LNH), 98),((LNH, MLH), 458),() ......
after this you may convert result RDDs into Dataframes or yu can directly insert into DB using RDDs mapPartitions
Related
I have a csv file with roughly 50K rows of search engine queries. Some of the search queries are the same, just in a different word order, for example "query A this is " and "this is query A".
I've tested using fuzzywuzzy's token_sort_ratio function to find matching word order queries, which works well, however I'm struggling with the runtime of the nested loop, and looking for optimisation tips.
Currently the nested for loops take around 60 hours to run on my machine. Does anyone know how I might speed this up?
Code below:
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
from tqdm import tqdm
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table1 = df['keyword'].to_list()
table2 = df['keyword'].to_list()
data = []
for kw_t1 in tqdm(table1):
for kw_t2 in table2:
score = fuzz.token_sort_ratio(kw_t1,kw_t2)
if score == 100 and kw_t1 != kw_t2:
data +=[[kw_t1, kw_t2, score]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Any advice would be appreciated.
Thanks!
Since what you are looking for are strings consisting of identical words (just not necessarily in the same order), there is no need to use fuzzy matching at all. You can instead use collections.Counter to create a frequency dict for each string, with which you can map the strings under a dict of lists keyed by their frequency dicts. You can then output sub-lists in the dicts whose lengths are greater than 1.
Since dicts are not hashable, you can make them keys of a dict by converting them to frozensets of tuples of key-value pairs first.
This improves the time complexity from O(n ^ 2) of your code to O(n) while also avoiding overhead of performing fuzzy matching.
from collections import Counter
matches = {}
for query in df['keyword']:
matches.setdefault(frozenset(Counter(query.split()).items()), []).append(query)
data = [match for match in matches.values() if len(match) > 1]
Demo: https://replit.com/#blhsing/WiseAfraidBrackets
I don't think you need fuzzywuzzy here: you are just checking for equality (score == 100) of the sorted queries, but with token_sort_ratio you are sorting the queries over and over. So I suggest to:
create a "base" list and a "sorted-elements" one
iterate on the elements.
This will still be O(n^2), but you will be sorting 50_000 strings instead of 2_500_000_000!
filePath = '/content/queries.csv'
df = pd.read_csv(filePath)
table_base = df['keyword'].to_list()
table_sorted = [sorted(kw) for kw in table_base]
data = []
ln = len(table_base)
for i in range(ln-1):
for j in range(i+1,ln):
if table_sorted[i] == table_sorted[j]:
data +=[[table_base[i], table_base[j], 100]]
data_df = pd.DataFrame(data, columns=['query', 'queryComparison', 'score'])
Apply in pandas as usually works faster:
kw_t2 = df['keyword'].to_list()
def compare(kw_t1):
found_duplicates = []
score = fuzz.token_sort_ratio(kw_t1, kw_t2)
if score == 100 and kw_t1 != kw_t2:
found_duplicates.append(kw_t2)
return found_duplicates
df["duplicates"] = df['keyword'].apply(compare)
What should I add or change to my code below in order to get a function that finds the mean length of reads??? I have to write a function, mean_length, that takes one argument: A dictionary, in which keys are read names and values are read sequences. The function must return a float, which is the average length of the sequence reads.
Hope someone can help me :D
I am very new to coding in python.
read_map = {'Read1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC',
'Read3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT',
'Read2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG',
'Read5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC',
'Read4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG',
'Read6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT'}
def mean_lenght (read_map):
print('keys : ',read_map.values())
for key in read_map.keys():
print(key)
#result = sum(...?)/len(read_map)
return result
print(mean_lenght(read_map))
A very one-line simple solution could be:
def mean_length(read_map):
return sum([len(v) for v in read_map.values()]) / len(read_map)
Basically, you construct a list of elements, each storing the length of an entry of read_map. Then, you sum up all those lengths and divide by the number of entries in your dict.
If your dictionary is very big, then constructing a list might not be the most memory efficient way. In this case:
def mean_length(read_map):
mean = 0
for v in read_map.values(): mean += len(v)
mean /= len(read_map)
return mean
In this way, you do not build any intermediate list.
The mean is the sum of lengths divided by the number of values, so let's just do this:
sum(map(len, read_map.values()))/len(read_map)
output: 55.333
Breakdown:
# "…" denotes the output of the previous line
read_map.values() -> returns the values of the dictionary
map(len, …) -> computes the length of each sequence
sum(…) -> get the total length
sum(…)/len(read_map) -> divide the total length by the number of sequence = mean
As a function:
def mean_length(d):
return sum(map(len, d.values()))/len(d)
>>> mean_length(read_map)
53.33
get a function that finds the mean length of reads???
python built-in module statistics has what you are looking for statistics.mean. Naturally you need to find lengths before feeding data in said function, for which len built-in function is useful.
import statistics
read_map = {'Read1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC',
'Read3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT',
'Read2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG',
'Read5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC',
'Read4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG',
'Read6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT'}
print(statistics.mean(len(v) for v in read_map.values()))
output
55.333333333333336
def mean_length(read_map):
total_chars = 0
for key in read_map.values():
total_chars = total_chars + len(key)
result = total_chars / len(read_map)
return result
I think this is the most intuitive code for beginner.
I am looking for a memory-efficient way to create a column in a pyspark DF using a function that is taking as arguments values from 'adjacent' (i.e., when the dataset has been sorted on a particular column) records, which are permutations of a particular hash.
That is, for a particular dataframe, which has several 'hash_n' (up to ~five) columns, I am looking to sort on each hash column, and create a new column based on a function of the hash in 'this' column, as well as the next several (up to ~15) columns. The function is essentially comparing the 'similarity' of the hashes and appending the 'index' of the 'other' column if the similarity is above some threshold.
Initially I was doing this with a window function and a pyspark UDF, but having run into outOfMemory problems, I am now converting to RDD, creating a dictionary of 'nearby' RDDs by incrementing the index, unioning the resulting dictionary values, and applying the function to the union via reduceByKey. This approach seems to be an improvement, although I am still running into memory issues (yarn killing containers as result of executor using too much memory; have tried several settings without being able to get around this problem).
Here is the relevant part of the code I am using (it is slightly more convoluted as it has to allow for the possibility of a single column on which to partition (in terms of how the data is processed -- we only compare hashes if they are in the same partition) the data, 'partnCol', or else a list of the same, 'partnCols'; in practice we always have at least one partition column); the parameters here are nPerms, the number of times that the hash is permuted (so that we have a total of nPerms+1 hash columns), and B, the number of adjacent records to look at for matches.
HSHTBLDict = {}; rdd0Dict = {}; rddDictDict = {}
possMtchsDict = {}; mtchsDict = {}
for nn in range(nPerms+1):
if (partnCol):
HSHTBLDict[nn] = _addNamedDFIndex(HSHTBL.orderBy\
(partnCol, 'hash_{0}'.format(nn)), 'newIdx')
rdd0Dict[nn] = HSHTBLDict[nn].select('newIdx', partnCol, '__index__', \
'hash_{0}'.format(nn))\
.orderBy(partnCol, 'hash_{0}'.format(nn)).rdd.map(tuple)\
.map(lambda kv: ((kv[0], kv[1]), (kv[2:])))
elif (partnCols):
HSHTBLDict[nn] = _addNamedDFIndex(HSHTBL.orderBy(\
*(partnCols+['hash_{0}'.format(nn)])), 'newIdx')
rdd0Dict[nn] = HSHTBLDict[nn].select('newIdx', \
*(partnCols+['__index__', 'hash_{0}'.format(nn)]))\
.orderBy(*(partnCols+['hash_{0}'.format(nn)])).rdd.map(tuple)\
.map(lambda kv: ((kv[:len(partnCols)+1]), \
(kv[len(partnCols)+1:])))
else:
HSHTBLDict[nn] = _addNamedDFIndex(HSHTBL.orderBy(\
'hash_{0}'.format(nn)), 'newIdx')
rdd0Dict[nn] = HSHTBLDict[nn].select('newIdx', '__index__', \
'hash_{0}'.format(nn))\
.orderBy('hash_{0}'.format(nn)).rdd.map(tuple)\
.map(lambda kv: ((kv[0], ), (kv[1:])))
rddDictDict[nn] = {}
for b in range(1, B+1):
def funcPos(r, b=b):
return r.map(lambda kv: (tuple([kv[0][x] + b if x==0 else kv[0][x] \
for x in range(len(kv[0]))]), kv[1]))
rddDictDict[nn][b] = funcPos(rdd0Dict[nn])
possMtchsDict[nn] = sc.union([rdd0Dict[nn]] + rddDictDict[nn].values())\
.reduceByKey(lambda x,y: x+y, numPartitions=rddParts)\
.mapValues(lambda v: tuple(v[i:i+2] \
for i in range(0, len(v), 2)))
mtchsDict[nn] = possMtchsDict[nn].mapValues(lambda v: tuple([ tuple([ v[x][0], \
[p[0] for p in v if _hashSim(v[x][1], p[1]) > thr]])\
for x in range(len(v)) ]) )
# union together all combinations from all hash columns
Mtchs = sc.union(mtchsDict.values())
# map to rdd of (__index__, matchList) pairs
mls = Mtchs.flatMap(lambda kv: kv[1]).reduceByKey(lambda x,y: list(set(x+y)), \
numPartitions=rddParts)
If anyone has any advice / ideas would be very happy to listen to them.
I did try reducing the number of cores per executor, and I found that although the process was (comparatively very) slow I could get it to complete, although I have not yet found a combination of settings that work for the full range of nPerms / B that I would like to achieve. Possibly it is necessary to re-work the code / approach in order to do that. I would also like to understand better the way in which the amount of memory that yarn is allowed for things like shuffles is determined. I have found that decreasing --executor-memory and increasing --conf spark.yarn.executor.memoryOverhead seems to help, but it still not clear to me how one would calculate the amount of memory that will be available.
I'm currently working with DNA sequence data and I have run into a bit of a performance roadblock.
I have two lookup dictionaries/hashes (as RDDs) with DNA "words" (short sequences) as keys and a list of index positions as the value. One is for a shorter query sequence and the other for a database sequence. Creating the tables is pretty fast even with very, very large sequences.
For the next step, I need to pair these up and find "hits" (pairs of index positions for each common word).
I first join the lookup dictionaries, which is reasonably fast. However, I now need the pairs, so I have to flatmap twice, once to expand the list of indices from the query and the second time to expand the list of indices from the database. This isn't ideal, but I don't see another way to do it. At least it performs ok.
The output at this point is: (query_index, (word_length, diagonal_offset)), where the diagonal offset is the database_sequence_index minus the query sequence index.
However, I now need to find pairs of indices on with the same diagonal offset (db_index - query_index) and reasonably close together and join them (so I increase the length of the word), but only as pairs (i.e. once I join one index with another, I don't want anything else to merge with it).
I do this with an aggregateByKey operation using a special object called Seed().
PARALELLISM = 16 # I have 4 cores with hyperthreading
def generateHsps(query_lookup_table_rdd, database_lookup_table_rdd):
global broadcastSequences
def mergeValueOp(seedlist, (query_index, seed_length)):
return seedlist.addSeed((query_index, seed_length))
def mergeSeedListsOp(seedlist1, seedlist2):
return seedlist1.mergeSeedListIntoSelf(seedlist2)
hits_rdd = (query_lookup_table_rdd.join(database_lookup_table_rdd)
.flatMap(lambda (word, (query_indices, db_indices)): [(query_index, db_indices) for query_index in query_indices], preservesPartitioning=True)
.flatMap(lambda (query_index, db_indices): [(db_index - query_index, (query_index, WORD_SIZE)) for db_index in db_indices], preservesPartitioning=True)
.aggregateByKey(SeedList(), mergeValueOp, mergeSeedListsOp, PARALLELISM)
.map(lambda (diagonal, seedlist): (diagonal, seedlist.mergedSeedList))
.flatMap(lambda (diagonal, seedlist): [(query_index, seed_length, diagonal) for query_index, seed_length in seedlist])
)
return hits_rdd
Seed():
class SeedList():
def __init__(self):
self.unmergedSeedList = []
self.mergedSeedList = []
#Try to find a more efficient way to do this
def addSeed(self, (query_index1, seed_length1)):
for i in range(0, len(self.unmergedSeedList)):
(query_index2, seed_length2) = self.unmergedSeedList[i]
#print "Checking ({0}, {1})".format(query_index2, seed_length2)
if min(abs(query_index2 + seed_length2 - query_index1), abs(query_index1 + seed_length1 - query_index2)) <= WINDOW_SIZE:
self.mergedSeedList.append((min(query_index1, query_index2), max(query_index1+seed_length1, query_index2+seed_length2)-min(query_index1, query_index2)))
self.unmergedSeedList.pop(i)
return self
self.unmergedSeedList.append((query_index1, seed_length1))
return self
def mergeSeedListIntoSelf(self, seedlist2):
print "merging seed"
for (query_index2, seed_length2) in seedlist2.unmergedSeedList:
wasmerged = False
for i in range(0, len(self.unmergedSeedList)):
(query_index1, seed_length1) = self.unmergedSeedList[i]
if min(abs(query_index2 + seed_length2 - query_index1), abs(query_index1 + seed_length1 - query_index2)) <= WINDOW_SIZE:
self.mergedSeedList.append((min(query_index1, query_index2), max(query_index1+seed_length1, query_index2+seed_length2)-min(query_index1, query_index2)))
self.unmergedSeedList.pop(i)
wasmerged = True
break
if not wasmerged:
self.unmergedSeedList.append((query_index2, seed_length2))
return self
This is where the performance really breaks down for even sequences of moderate length.
Is there any better way to do this aggregation? My gut feeling says yes, but I can't come up with it.
I know this is a very long winded and technical question, and I would really appreciate any insight even if there is no easy solution.
Edit: Here is how I am making the lookup tables:
def createLookupTable(sequence_rdd, sequence_name, word_length):
global broadcastSequences
blank_list = []
def addItemToList(lst, val):
lst.append(val)
return lst
def mergeLists(lst1, lst2):
#print "Merging"
return lst1+lst2
return (sequence_rdd
.flatMap(lambda seq_len: range(0, seq_len - word_length + 1))
.repartition(PARALLELISM)
#.partitionBy(PARALLELISM)
.map(lambda index: (str(broadcastSequences.value[sequence_name][index:index + word_length]), index), preservesPartitioning=True)
.aggregateByKey(blank_list, addItemToList, mergeLists, PARALLELISM))
#.map(lambda (word, indices): (word, sorted(indices))))
And here is the function that runs the whole operation:
def run(query_seq, database_sequence, translate_query=False):
global broadcastSequences
scoring_matrix = 'nucleotide' if isinstance(query_seq.alphabet, DNAAlphabet) else 'blosum62'
sequences = {'query': query_seq,
'database': database_sequence}
broadcastSequences = sc.broadcast(sequences)
query_rdd = sc.parallelize([len(query_seq)])
query_rdd.cache()
database_rdd = sc.parallelize([len(database_sequence)])
database_rdd.cache()
query_lookup_table_rdd = createLookupTable(query_rdd, 'query', WORD_SIZE)
query_lookup_table_rdd.cache()
database_lookup_table_rdd = createLookupTable(database_rdd, 'database', WORD_SIZE)
seeds_rdd = generateHsps(query_lookup_table_rdd, database_lookup_table_rdd)
return seeds_rdd
Edit 2: I have tweaked things a bit and slightly improved performance by replacing:
.flatMap(lambda (word, (query_indices, db_indices)): [(query_index, db_indices) for query_index in query_indices], preservesPartitioning=True)
.flatMap(lambda (query_index, db_indices): [(db_index - query_index, (query_index, WORD_SIZE)) for db_index in db_indices], preservesPartitioning=True)
in hits_rdd with:
.flatMap(lambda (word, (query_indices, db_indices)): itertools.product(query_indices, db_indices))
.map(lambda (query_index, db_index): (db_index - query_index, (query_index, WORD_SIZE) ))
At least now I'm not burning up storage with intermediate data structures as much.
Let's forget about the technical details of what your doing and think "functionally" about the steps involved, forgetting about the details of the implementation. Functional thinking like this is an important part of parallel data analysis; ideally if we can break the problem up like this, we can reason more clearly about the steps involved, and end up with clearer and often more concise. Thinking in terms of a tabular data model, I would consider your problem to consist of the following steps:
Join your two datasets on the sequence column.
Create a new column delta containing the difference between the indices.
Sort by (either) index to make sure that the subsequences are in the correct order.
Group by delta and concatenate the strings in the sequence column, to obtain the full matches between your datasets.
For the first 3 steps, I think it makes sense to use DataFrames, since this data model makes sense in my head of the kind processing that we're doing. (Actually I might use DataFrames for step 4 as well, except pyspark doesn't currently support custom aggregate functions for DataFrames, although Scala does).
For the fourth step (which is if I understand correctly what you're really asking about in your question), it's a little tricky to think about how to do this functionally, however I think an elegant and efficient solution is to use a reduce (also known a right fold); this pattern can be applied to any problem that you can phrase in terms of iteratively applying an associative binary function, that is a function where the "grouping" of any 3 arguments doesn't matter (although the order certainly may matter), Symbolically, this is a function x,y -> f(x,y) where f(x, f(y, z)) = f(f(x, y), z). String (or more generally list) concatenation is just such a function.
Here's an example of how this might look in pyspark; hopefully you can adapt this to the details of your data:
#setup some sample data
query = [['abcd', 30] ,['adab', 34] ,['dbab',38]]
reference = [['dbab', 20], ['ccdd', 24], ['abcd', 50], ['adab',54], ['dbab',58], ['dbab', 62]]
#create data frames
query_df = sqlContext.createDataFrame(query, schema = ['sequence1', 'index1'])
reference_df = sqlContext.createDataFrame(reference, schema = ['sequence2', 'index2'])
#step 1: join
matches = query_df.join(reference_df, query_df.sequence1 == reference_df.sequence2)
#step 2: calculate delta column
matches_delta = matches.withColumn('delta', matches.index2 - matches.index1)
#step 3: sort by index
matches_sorted = matches_delta.sort('delta').sort('index2')
#step 4: convert back to rdd and reduce
#note that + is just string concatenation for strings
r = matches_sorted['delta', 'sequence1'].rdd
r.reduceByKey(lambda x, y : x + y).collect()
#expected output:
#[(24, u'dbab'), (-18, u'dbab'), (20, u'abcdadabdbab')]
I have a csv file with a single column, but 6.2 million rows, all containing strings between 6 and 20ish letters. Some strings will be found in duplicate (or more) entries, and I want to write these to a new csv file - a guess is that there should be around 1 million non-unique strings. That's it, really. Continuously searching through a dictionary of 6 million entries does take its time, however, and I'd appreciate any tips on how to do it. Any script I've written so far takes at least a week (!) to run, according to some timings I did.
First try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
out_file_2 = open('UniProt Unique Trypsin Peptides.csv','w+')
writer_1 = csv.writer(out_file_1)
writer_2 = csv.writer(out_file_2)
# Create trypsinome dictionary construct
ref_dict = {}
for row in range(len(in_list_1)):
ref_dict[row] = in_list_1[row]
# Find unique/non-unique peptides from trypsinome
Peptide_list = []
Uniques = []
for n in range(len(in_list_1)):
Peptide = ref_dict.pop(n)
if Peptide in ref_dict.values(): # Non-unique peptides
Peptide_list.append(Peptide)
else:
Uniques.append(Peptide) # Unique peptides
for m in range(len(Peptide_list)):
Write_list = (str(Peptide_list[m]).replace("'","").replace("[",'').replace("]",''),'')
writer_1.writerow(Write_list)
Second try:
in_file_1 = open('UniProt Trypsinome (full).csv','r')
in_list_1 = list(csv.reader(in_file_1))
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+')
writer_1 = csv.writer(out_file_1)
ref_dict = {}
for row in range(len(in_list_1)):
Peptide = in_list_1[row]
if Peptide in ref_dict.values():
write = (in_list_1[row],'')
writer_1.writerow(write)
else:
ref_dict[row] = in_list_1[row]
EDIT: here's a few lines from the csv file:
SELVQK
AKLAEQAER
AKLAEQAERR
LAEQAER
LAEQAERYDDMAAAMK
LAEQAERYDDMAAAMKK
MTMDKSELVQK
YDDMAAAMKAVTEQGHELSNEER
YDDMAAAMKAVTEQGHELSNEERR
Do it with Numpy. Roughly:
import numpy as np
column = 42
mat = np.loadtxt("thefile", dtype=[TODO])
uniq = set(np.unique(mat[:,column]))
for row in mat:
if row[column] not in uniq:
print row
You could even vectorize the output stage using numpy.savetxt and the char-array operators, but it probably won't make very much difference.
First hint : Python has support for lazy evaluation, better to use it when dealing with huge datasets. So :
iterate over your csv.reader instead of building a huge in-memory list,
don't build huge in-memory lists with ranges - use enumerate(seq) instead if you need both the item and index, and just iterate over your sequence's items if you don't need the index.
Second hint : the main point of using a dict (hashtable) is to lookup on keys, not values... So don't build a huge dict that's used as a list.
Third hint : if you just want a way to store "already seen" values, use a Set.
I'm not so good in Python, so I don't know how the 'in' works, but your algorithm seems to run in n².
Try to sort your list after reading it, with an algo in n log(n), like quicksort, it should work better.
Once the list is ordered, you just have to check if two consecutive elements of the list are the same.
So you get the reading in n, the sorting in n log(n) (at best), and the comparison in n.
Although I think that the numpy solution is the best, I'm curious whether we can speed up the given example. My suggestions are:
skip csv.reader costs and just read the line
rb to skip the extra scan needed to fix newlines
use bigger file buffer sizes (read 1Meg, write 64K is probably good)
use the dict keys as an index - key lookup is much faster than value lookup
I'm not a numpy guy, so I'd do something like
in_file_1 = open('UniProt Trypsinome (full).csv','rb', 1048576)
out_file_1 = open('UniProt Non-Unique Reference Trypsinome.csv','w+', 65536)
ref_dict = {}
for line in in_file_1:
peptide = line.rstrip()
if peptide in ref_dict:
out_file_1.write(peptide + '\n')
else:
ref_dict[peptide] = None