I'm trying to create a little load balancing function that will intake a list of ordered numbers (these numbers will be string lengths) and output load-balanced chunks. The idea is that we start with a chunk with index 0 (smallest string length) and then add to it the index -1 (longest string length). And we repeat this until we run out of string lengths (stored in list_ordered) so that each chunk has a desired chunk_size.
Anyway, the function below works fine but is not exactly scalable since we are storing all the data in the list of lists res; My question is, taken into account what I want and the code below, could you please help me convert this function into a generator?
Thanks!
def chunk_generator_load_balanced(list_ordered,chunk_size):
n_chunks=ceil(len(list_ordered)/chunk_size)
res=[]
direction_chunks={}
for i in range(n_chunks):
res.append([])
direction_chunks[i]=True
chunk_index=0
while list_ordered:
if direction_chunks[chunk_index]:
chunk_val=list_ordered.pop(0)
direction_chunks[chunk_index]=False
else:
chunk_val=list_ordered.pop(-1)
direction_chunks[chunk_index]=True
res[chunk_index].append(chunk_val)
if chunk_index==n_chunks-1: chunk_index=0
else: chunk_index+=1
return res
if __name__ == '__main__':
list_keys=[i for i in range(50)]
a=chunk_generator_load_balanced(list_keys,10)
why don't just use yield instead of return in chunk_generator_load_balanced function?
I mean this:
def chunk_generator_load_balanced(list_ordered,chunk_size):
n_chunks=ceil(len(list_ordered)/chunk_size)
res=[]
direction_chunks={}
for i in range(n_chunks):
res.append([])
direction_chunks[i]=True
chunk_index=0
while list_ordered:
if direction_chunks[chunk_index]:
chunk_val=list_ordered.pop(0)
direction_chunks[chunk_index]=False
else:
chunk_val=list_ordered.pop(-1)
direction_chunks[chunk_index]=True
res[chunk_index].append(chunk_val)
if chunk_index==n_chunks-1: chunk_index=0
else: chunk_index+=1
yield res
Related
What should I add or change to my code below in order to get a function that finds the mean length of reads??? I have to write a function, mean_length, that takes one argument: A dictionary, in which keys are read names and values are read sequences. The function must return a float, which is the average length of the sequence reads.
Hope someone can help me :D
I am very new to coding in python.
read_map = {'Read1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC',
'Read3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT',
'Read2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG',
'Read5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC',
'Read4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG',
'Read6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT'}
def mean_lenght (read_map):
print('keys : ',read_map.values())
for key in read_map.keys():
print(key)
#result = sum(...?)/len(read_map)
return result
print(mean_lenght(read_map))
A very one-line simple solution could be:
def mean_length(read_map):
return sum([len(v) for v in read_map.values()]) / len(read_map)
Basically, you construct a list of elements, each storing the length of an entry of read_map. Then, you sum up all those lengths and divide by the number of entries in your dict.
If your dictionary is very big, then constructing a list might not be the most memory efficient way. In this case:
def mean_length(read_map):
mean = 0
for v in read_map.values(): mean += len(v)
mean /= len(read_map)
return mean
In this way, you do not build any intermediate list.
The mean is the sum of lengths divided by the number of values, so let's just do this:
sum(map(len, read_map.values()))/len(read_map)
output: 55.333
Breakdown:
# "…" denotes the output of the previous line
read_map.values() -> returns the values of the dictionary
map(len, …) -> computes the length of each sequence
sum(…) -> get the total length
sum(…)/len(read_map) -> divide the total length by the number of sequence = mean
As a function:
def mean_length(d):
return sum(map(len, d.values()))/len(d)
>>> mean_length(read_map)
53.33
get a function that finds the mean length of reads???
python built-in module statistics has what you are looking for statistics.mean. Naturally you need to find lengths before feeding data in said function, for which len built-in function is useful.
import statistics
read_map = {'Read1': 'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC',
'Read3': 'GTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGTCGTGAACACATCAGT',
'Read2': 'CTTTACCCGGAAGAGCGGGACGCTGCCCTGCGCGATTCCAGGCTCCCCACGGG',
'Read5': 'CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC',
'Read4': 'TGCGAGGGAAGTGAAGTATTTGACCCTTTACCCGGAAGAGCG',
'Read6': 'TGACAGTAGATCTCGTCCAGACCCCTAGCTGGTACGTCTTCAGTAGAAAATTGTTTTTTTCTTCCAAGAGGTCGGAGT'}
print(statistics.mean(len(v) for v in read_map.values()))
output
55.333333333333336
def mean_length(read_map):
total_chars = 0
for key in read_map.values():
total_chars = total_chars + len(key)
result = total_chars / len(read_map)
return result
I think this is the most intuitive code for beginner.
I am trying to get proportion of nouns in my text using the code below and it is giving me an error. I am using a function that calculates the number of nouns in my text and I have the overall word count in a different column.
pos_family = {
'noun' : ['NN','NNS','NNP','NNPS']
}
def check_pos_tag(x, flag):
cnt = 0
try:
for tag,value in x.items():
if tag in pos_family[flag]:
cnt +=value
except:
pass
return cnt
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')/df2['word_count'])
Note: I have used nltk package to get the counts by PoS tags and I have the counts in a dictionary in PoS_Count column in my dataframe.
If I remove "/df2['word_count']" in the first run and get the noun count and include it again and run, it works fine but if I run it for the first time I get the below error.
ValueError: Wrong number of items passed 100, placement implies 1
Any help is greatly appreciated
Thanks in Advance!
As you have guessed, the problem is in the /df2['word_count'] bit.
df2['word_count'] is a pandas series, but you need to use a float or int here, because you are dividing check_pos_tag(x, 'noun') (which is an int) by it.
A possible solution is to extract the corresponding field from the series and use it in your lambda.
However, it would be easier (and arguably faster) to do each operation alone.
Try this:
df2['noun_count'] = df2['PoS_Count'].apply(lambda x: check_pos_tag(x, 'noun')) / df2['word_count']
How do you get the very next list within a nested list in python?
I have a few lists:
charLimit = [101100,114502,124602]
conditionalNextQ = [101101, 101200, 114503, 114504, 124603, 124604]`
response = [[100100,4]
,[100300,99]
,[1100500,6]
,[1100501,04]
,[100700,12]
,[100800,67]
,[100100,64]
,[100300,26]
,[100500,2]
,[100501,035]
,[100700,9]
,[100800,8]
,[101100,"hello"]
,[101101,"twenty"] ... ]
for question in charLimit:
for limitQuestion in response:
limitNumber = limitQuestion[0]
if question == limitNumber:
print(limitQuestion)
The above code is doing what I want, i.e. printing the list instances in response when it contains one of the numbers in charlimit. However, I also want it to print the immediate next value in response also.
For example the second-to-last value in response contains 101100 (a value thats in charlimit) so I want it to not only print
101100,"hello"
(as the code does at the moment)
but the very next list also (and only the next)
101100,"hello"
101101,"twenty"
Thank is advance for any help here. Please note that response is a verrrrry long list and so I'm looking to make things fairly efficient if possible, although its not crucial in the context of this work. I'm probably missing something very simple but cant find examples of anyone doing this without using specific indexes in very small lists.
You can use enumerate
Ex:
charLimit = [101100,114502,124602]
conditionalNextQ = [101101, 101200, 114503, 114504, 124603, 124604]
response = [[100100,4]
,[100300,99]
,[1100500,6]
,[1100501,04]
,[100700,12]
,[100800,67]
,[100100,64]
,[100300,26]
,[100500,2]
,[100501,035]
,[100700,9]
,[100800,8]
,[101100,"hello"]
,[101101,"twenty"]]
l = len(response) - 1
for question in charLimit:
for i, limitQuestion in enumerate(response):
limitNumber = limitQuestion[0]
if question == limitNumber:
print(limitQuestion)
if (i+1) <= l:
print(response[i+1])
Output:
[101100, 'hello']
[101101, 'twenty']
I would eliminate the loop over charLimit and loop over response instead. Using enumerate in this loop allows us to access the next element by index, in the case that we want to print it:
for i, limitQuestion in enumerate(response, 1):
limitNumber = limitQuestion[0]
# use the `in` operator to check if `limitNumber` equals any
# of the numbers in `charLimit`
if limitNumber in charLimit:
print(limitQuestion)
# if this isn't the last element in the list, also
# print the next one
if i < len(response):
print(response[i])
If charLimit is very long, you should consider defining it as a set instead, because sets have faster membership tests than lists:
charLimit = {101100,114502,124602}
I want to make a list of elements where each element starts with 4 numbers and ends with 4 letters with every possible combination. This is my code
import itertools
def char_range(c1, c2):
"""Generates the characters from `c1` to `c2`"""
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
chars =list()
nums =list()
for combination in itertools.product(char_range('a','b'),repeat=4):
chars.append(''.join(map(str, combination)))
for combination in itertools.product(range(10),repeat=4):
nums.append(''.join(map(str, combination)))
c = [str(x)+y for x,y in itertools.product(nums,chars)]
for dd in c:
print(dd)
This runs fine but when I use a bigger range of characters, such as (a-z) the program hogs the CPU and memory, and the PC becomes unresponsive. So how can I do this in a more efficient way?
The documentation of itertools says that "it is roughly equivalent to nested for-loops in a generator expression". So itertools.product is never an enemy of memory, but if you store its results in a list, that list is. Therefore:
for element in itertools.product(...):
print element
is okay, but
myList = [element for itertools.product(...)]
or the equivalent loop of
for element in itertools.product(...):
myList.append(element)
is not! So you want itertools to generate results for you, but you don't want to store them, rather use them as they are generated. Think about this line of your code:
c = [str(x)+y for x,y in itertools.product(nums,chars)]
Given that nums and chars can be huge lists, building another gigantic list of all combinations on top of them is definitely going to choke your system.
Now, as mentioned in the comments, if you replace all the lists that are too fat to fit into the memory with generators (functions that just yield), memory is not going to be a concern anymore.
Here is my full code. I basically changed your lists of chars and nums to generators, and got rid of the final list of c.
import itertools
def char_range(c1, c2):
"""Generates the characters from `c1` to `c2`"""
for c in range(ord(c1), ord(c2)+1):
yield chr(c)
def char(a):
for combination in itertools.product(char_range(str(a[0]),str(a[1])),repeat=4):
yield ''.join(map(str, combination))
def num(n):
for combination in itertools.product(range(n),repeat=4):
yield ''.join(map(str, combination))
def final(one,two):
for foo in char(one):
for bar in num(two):
print str(bar)+str(foo)
Now let's ask what every combination of ['a','b'] and range(2) is:
final(['a','b'],2)
Produces this:
0000aaaa
0001aaaa
0010aaaa
0011aaaa
0100aaaa
0101aaaa
0110aaaa
0111aaaa
1000aaaa
1001aaaa
1010aaaa
1011aaaa
1100aaaa
1101aaaa
1110aaaa
1111aaaa
0000aaab
0001aaab
0010aaab
0011aaab
0100aaab
0101aaab
0110aaab
0111aaab
1000aaab
1001aaab
1010aaab
1011aaab
1100aaab
1101aaab
1110aaab
1111aaab
0000aaba
0001aaba
0010aaba
0011aaba
0100aaba
0101aaba
0110aaba
0111aaba
1000aaba
1001aaba
1010aaba
1011aaba
1100aaba
1101aaba
1110aaba
1111aaba
0000aabb
0001aabb
0010aabb
0011aabb
0100aabb
0101aabb
0110aabb
0111aabb
1000aabb
1001aabb
1010aabb
1011aabb
1100aabb
1101aabb
1110aabb
1111aabb
0000abaa
0001abaa
0010abaa
0011abaa
0100abaa
0101abaa
0110abaa
0111abaa
1000abaa
1001abaa
1010abaa
1011abaa
1100abaa
1101abaa
1110abaa
1111abaa
0000abab
0001abab
0010abab
0011abab
0100abab
0101abab
0110abab
0111abab
1000abab
1001abab
1010abab
1011abab
1100abab
1101abab
1110abab
1111abab
0000abba
0001abba
0010abba
0011abba
0100abba
0101abba
0110abba
0111abba
1000abba
1001abba
1010abba
1011abba
1100abba
1101abba
1110abba
1111abba
0000abbb
0001abbb
0010abbb
0011abbb
0100abbb
0101abbb
0110abbb
0111abbb
1000abbb
1001abbb
1010abbb
1011abbb
1100abbb
1101abbb
1110abbb
1111abbb
0000baaa
0001baaa
0010baaa
0011baaa
0100baaa
0101baaa
0110baaa
0111baaa
1000baaa
1001baaa
1010baaa
1011baaa
1100baaa
1101baaa
1110baaa
1111baaa
0000baab
0001baab
0010baab
0011baab
0100baab
0101baab
0110baab
0111baab
1000baab
1001baab
1010baab
1011baab
1100baab
1101baab
1110baab
1111baab
0000baba
0001baba
0010baba
0011baba
0100baba
0101baba
0110baba
0111baba
1000baba
1001baba
1010baba
1011baba
1100baba
1101baba
1110baba
1111baba
0000babb
0001babb
0010babb
0011babb
0100babb
0101babb
0110babb
0111babb
1000babb
1001babb
1010babb
1011babb
1100babb
1101babb
1110babb
1111babb
0000bbaa
0001bbaa
0010bbaa
0011bbaa
0100bbaa
0101bbaa
0110bbaa
0111bbaa
1000bbaa
1001bbaa
1010bbaa
1011bbaa
1100bbaa
1101bbaa
1110bbaa
1111bbaa
0000bbab
0001bbab
0010bbab
0011bbab
0100bbab
0101bbab
0110bbab
0111bbab
1000bbab
1001bbab
1010bbab
1011bbab
1100bbab
1101bbab
1110bbab
1111bbab
0000bbba
0001bbba
0010bbba
0011bbba
0100bbba
0101bbba
0110bbba
0111bbba
1000bbba
1001bbba
1010bbba
1011bbba
1100bbba
1101bbba
1110bbba
1111bbba
0000bbbb
0001bbbb
0010bbbb
0011bbbb
0100bbbb
0101bbbb
0110bbbb
0111bbbb
1000bbbb
1001bbbb
1010bbbb
1011bbbb
1100bbbb
1101bbbb
1110bbbb
1111bbbb
Which is the exact result you are looking for. Each element of this result is generated on the fly, hence never creates a memory problem. You can now try and see that much bigger operations such as final(['a','z'],10) are CPU-friendly.
I've got a question regarding Linear Searching in Python. Say I've got the base code of
for l in lines:
for f in search_data:
if my_search_function(l[1],[f[0],f[2]]):
print "Found it!"
break
in which we want to determine where in search_data exists the value stored in l[1]. Say my_search_function() looks like this:
def my_search_function(search_key, search_values):
for s in search_values:
if search_key in s:
return True
return False
Is there any way to increase the speed of processing? Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes. I've tried an outside-in approach, i.e.
for line in lines:
negative_index = -1
positive_index = 0
middle_element = len(search_data) /2 if len(search_data) %2 == 0 else (len(search_data)-1) /2
found = False
while positive_index < middle_element:
# print str(positive_index)+","+str(negative_index)
if my_search_function(line[1], [search_data[positive_index][0],search_data[negative_index][0]]):
print "Found it!"
break
positive_index = positive_index +1
negative_index = negative_index -1
However, I'm not seeing any speed increases from this. Does anyone have a better approach? I'm looking to cut the processing speed in half as I'm working with large amounts of CSV and the processing time for one file is > 00:15 which is unacceptable as I'm processing batches of 30+ files. Basically the data I'm searching on is essentially SKUs. A value from lines[0] could be something like AS123JK and a valid match for that value could be AS123. So a HashMap would not work here, unless there exists a way to do partial matches in a HashMap lookup that wouldn't require me breaking down the values like ['AS123', 'AS123J', 'AS123JK'], which is not ideal in this scenario. Thanks!
Binary Search would not work in this case, as lines and search_data are multidimensional lists and I need to preserve the indexes.
Regardless, it may be worth your while to extract the strings (along with some reference to the original data structure) into a flat list, sort it, and perform fast binary searches on it with help of the bisect module.
Or, instead of a large number of searches, sort also a combined list of all the search keys and traverse both lists in parallel, looking for matches. (Proceeding in a similar manner to the merge step in merge sort, without actually outputting a merged list)
Code to illustrate the second approach:
lines = ['AS12', 'AS123', 'AS123J', 'AS123JK','AS124']
search_keys = ['AS123', 'AS125']
try:
iter_keys = iter(sorted(search_keys))
key = next(iter_keys)
for line in sorted(lines):
if line.startswith(key):
print('Line {} matches {}'.format(line, key))
else:
while key < line[:len(key)]:
key = next(iter_keys)
except StopIteration: # all keys processed
pass
Depends on problem detail.
For instance if you search for complete words, you could create a hashtable on searchable elements, and the final search would be a simple lookup.
Filling the hashtable is pseudo-linear.
Ultimately, I was broke down and implemented Binary Search on my multidimensional lists by sorting using the sorted() function with a lambda as a key argument.Here is the first pass code that I whipped up. It's not 100% efficient, but it's a vast improvement from where we were
def binary_search(master_row, source_data,master_search_index, source_search_index):
lower_bound = 0
upper_bound = len(source_data) - 1
found = False
while lower_bound <= upper_bound and not found:
middle_pos = (lower_bound + upper_bound) // 2
if source_data[middle_pos][source_search_index] < master_row[master_search_index]:
if search([source_data[middle_pos][source_search_index]],[master_row[master_search_index]]):
return {"result": True, "index": middle_pos}
break
lower_bound = middle_pos + 1
elif source_data[middle_pos][source_search_index] > master_row[master_search_index] :
if search([master_row[master_search_index]],[source_data[middle_pos][source_search_index]]):
return {"result": True, "index": middle_pos}
break
upper_bound = middle_pos - 1
else:
if len(source_data[middle_pos][source_search_index]) > 5:
return {"result": True, "index": middle_pos}
else:
break
and then where we actually make the Binary Search call
#where master_copy is the first multidimensional list, data_copy is the second
#the search columns are the columns we want to search against
for line in master_copy:
for m in master_search_columns:
found = False
for d in data_search_columns:
data_copy = sorted(data_copy, key=lambda x: x[d], reverse=False)
results = binary_search(line, data_copy,m, d)
found = results["result"]
if found:
line = update_row(line, data_copy[results["index"]], column_mapping)
found_count = found_count +1
break
if found:
break
Here's the info for sorting a multidimensional list Python Sort Multidimensional Array Based on 2nd Element of Subarray