I want to calculate the largest covering of a string from many sets of substrings.
All strings in this problem are lowercased, and contain no whitespace or unicode strangeness.
So, given a string: abcdef, and two groups of strings: ['abc', 'bc'], ['abc', 'd'], the second group (['abc', 'd']) covers more of the original string. Order matters for exact matches, so the term group ['fe', 'cba'] would not match the original string.
I have a large collection of strings, and a large collection of terms-groups. So I would like a bit faster implementation if possible.
I've tried the following in Python for an example. I've used Pandas and Numpy because I thought it may speed it up a bit. I'm also running into an over-counting problem as you'll see below.
import re
import pandas as pd
import numpy as np
my_strings = pd.Series(['foobar', 'foofoobar0', 'apple'])
term_sets = pd.Series([['foo', 'ba'], ['foo', 'of'], ['app', 'ppl'], ['apple'], ['zzz', 'zzapp']])
# For each string, calculate best proportion of coverage:
# Try 1: Create a function for each string.
def calc_coverage(mystr, term_sets):
# Total length of string
total_chars = len(mystr)
# For each term set, sum up length of any match. Problem: this over counts when matches overlap.
total_coverage = term_sets.apply(lambda x: np.sum([len(term) if re.search(term, mystr) else 0 for term in x]))
# Fraction of String covered. Note the above over-counting can result in fractions > 1.0.
coverage_proportion = total_coverage/total_chars
return coverage_proportion.argmax(), coverage_proportion.max()
my_strings.apply(lambda x: calc_coverage(x, term_sets))
This results in:
0 (0, 0.8333333333333334)
1 (0, 0.5)
2 (2, 1.2)
Which presents some problems. The biggest problem I see is that over-lapping terms are being counted up separately, which results in the 1.2 or 120% coverage.
I think the ideal output would be:
0 (0, 0.8333333333333334)
1 (0, 0.8)
2 (3, 1.0)
I think I can write a double for loop and brute force it. But this problem feels like there's a more optimal solution. Or a small change on what I've done so far to get it to work.
Note: If there is a tie- returning the first is fine. I'm not too interested in returning all best matches.
Ok, this is not optimized but let's start fixing the results. I believe you have two issues: one is the over-counting in apple; the other is the under-counting in foofoobar0.
Solving the second issue when the term set is composed of two non-overlapping terms (or just one term), is easy:
sum([s.count(t)*len(t) for t in ts])
will do the job.
Similarly, when we have two overlapping terms, we will just take the "best" one:
max([s.count(t)*len(t) for t in ts])
So we are left with the problem of recognizing when the two terms overlap. I don't even consider term sets with more than two terms, because the solution will already be painfully slow with two :(
Let's define a function to test for overlapping:
def terms_overlap(s, ts):
if ts[0] not in s or ts[1] not in s:
return False
start = 0
while (pos_0 := s.find(ts[0], start)) > -1:
if (pos_1 := s.find(ts[1], pos_0)) > -1:
if pos_0 <= pos_1 < (pos_0 + len(ts[0]) - 1):
return True
start += pos_0 + len(ts[0])
start = 0
while (pos_1 := s.find(ts[1], start)) > -1:
if (pos_0 := s.find(ts[0], pos_1)) > -1:
if pos_1 <= pos_0 < (pos_1 + len(ts[1]) - 1):
return True
start += pos_1 + len(ts[1])
return False
With that function ready we can finally do:
def calc_coverage(strings, tsets):
for xs, s in enumerate(strings):
best_cover = 0
best_ts = 0
for xts, ts in enumerate(tsets):
if len(ts) == 1:
cover = s.count(ts[0])*len(ts[0])
elif len(ts) == 2:
if terms_overlap(s, ts):
cover = max([s.count(t)*len(t) for t in ts])
else:
cover = sum([s.count(t)*len(t) for t in ts])
else:
raise ValueError('Cannot handle term sets of more than two terms')
if cover > best_cover:
best_cover = cover
best_ts = xts
print(f'{xs}: {s:15} {best_cover:2d} / {len(s):2d} = {best_cover/len(s):8.3f} ({best_ts}: {tsets[best_ts]})')
>>> calc_coverage(my_strings, term_sets)
0: foobar 5 / 6 = 0.833 (0: ['foo', 'ba'])
1: foofoobar0 8 / 10 = 0.800 (0: ['foo', 'ba'])
2: apple 5 / 5 = 1.000 (3: ['apple'])
I have a dataframe with 3 million rows. I need to transform the values in a column. The column contains strings joined together with ";". The transformation involves breaking up the string into its components and then choosing one of the strings based on some priority rules.
Here is the sample dataset and the function:
data = {'Name': ['X1', 'X2', 'X3', 'X4', 'X5','X6'], 'category': ['CatA;CatB', 'CatB', None, 'CatB;CatC;CatA', 'CatA;CatB', 'CatB;CatD;CatB;CatC;CatA']}
sample_dataframe = pd.DataFrame(data)
def cat_name(x):
if x:
x = pd.Series(x.split(";"))
y = x[(x!='CatA') & x.notna()]
custom_dict = {'CatC': 0, 'CatD':1, 'CatB': 2, 'CatE': 3}
if x.count() == 1:
return x.iloc[0]
elif y.count() > 1:
y = y.sort_values(key=lambda x: x.map(custom_dict))
if y.count() > 2:
return '3 or more'
else:
return y.iloc[0]+'+'
elif y.count() == 1:
return y.iloc[0]
else:
return None
else:
return None
I am using the apply method test_data = sample_dataframe['category'].apply(cat_name) to run the function on the column. For my dataset of 3 million rows, the function takes almost 10 minutes to run.
How can I optimize the function to run faster?
Also, I have two set of of category rules and the output category needs to be stored in two columns. Currently I am using the apply function twice. Kinda dumb and slow, I know, but it works.
Is there a way to run the function at the same time for a different priority dictionary and return two output values? I tried to use
test_data['CAT_NAME'], test_data['MAIN_CAT_NAME']=zip(*sample_dataframe['category'].apply(joint_cat_name)) with the function
def joint_cat_name(x):
cat_string = x
if cat_string:
string_series = pd.Series(cat_string.split(";"))
y = string_series[(string_series!='CatA') & string_series.notna()]
custom_dict = {'CatB': 0, 'CatC':1, 'CatD': 2, 'CatE': 3}
if string_series.count() == 1:
return string_series.iloc[0], string_series.iloc[0]
elif y.count() > 1:
y = y.sort_values(key=lambda x: x.map(custom_dict))
if y.count() > 2:
return '3 or more', y.iloc[0]
elif y.count() == 1:
return y.iloc[0]+'+', y.iloc[0]
elif y.count() == 1:
return y.iloc[0], y.iloc[0]
else:
return None, None
else:
return None, None
But I got an error TypeError: 'NoneType' object is not iterable when the zip function encountered tuple containing Nones. ie it threw an error when output was (None, None)
Thanks a lot in advance.
Your function does a lot of unnecessary work. Even if you just reorder some conditionals it will run much faster.
custom_dict = {"CatC": 0, "CatD": 1, "CatB": 2, "CatE": 3}
def cat_name(x):
if x is None:
return x
xs = x.split(";")
if len(xs) == 1:
return xs[0]
ys = [x for x in xs if x != "CatA"]
l = len(ys)
if l == 0:
return None
if l == 1:
return ys[0]
if l == 2:
return min(ys, key=lambda k: custom_dict[k]) + "+"
if l > 2:
return "3 or more"
Faster than running one Python method on each row might be to go through your dataframe multiple times, and each time use an optimized Pandas query. You'd have to rewrite your code something like this:
# select empty categories
no_cat = sample_dataframe['category'].isna()
# select categorie strings with only one category
single_cat = ~no_cat & (sample_dataframe['category'].str.count(";") == 0)
# get number of categories
num_cats = sample_dataframe['category'].str.count(";") + 1
three_or_more = num_cats > 2
# has a "CatA" category
has_cat_A = sample_dataframe['category'].str.contains("CatA", na=False)
# then also write these selected rows in a custom way
sample_dataframe["cat_name"] = ""
cat_name_col = sample_dataframe["cat_name"]
cat_name_col[no_cat] = None
cat_name_col[single_cat] = sample_dataframe["category"][single_cat]
cat_name_col[three_or_more] = "3 or more"
# continue with however complex you want to get to cover more cases, e.g.
two_cats_no_cat_A = (num_cats == 2) & ~has_cat_A
# then handle only the remaining cases with the apply
not_handled = ~no_cat & ~single_cat & ~three_or_more
cat_name_col[not_handled] = sample_dataframe["category"][not_handled].apply(cat_name)
Running these queries on 3 million rows should be plenty fast, even if you have to do a few of them and combine them. If it's still too slow, you can handle more special cases from the apply in the same vectorized fashion.
This could be a foolish question; maybe my though process is totally wrong (if so, please point it out), but how do you extract the three incremented variables (c_char, c_word, c_sentence) inside a custom function and use it for other uses?
def length_finder(x):
#variables counting character,word,sentnece
c_char = 0
c_word = 1
c_sentence = 0
for i in x:
if (i >= 'a' and i <= 'z') or (i >= 'A' and i <= 'Z'):
c_char += 1
if i == " ":
c_word += 1
if i == '.' or i == '!' or i == '?' :
c_sentence += 1
length_finder(input("Enter the text you wish to analyze: "))
L = 100/c_word*c_char
S = 100/c_word*c_sentence
#formula to get readability
index = 0.0588 * L - 0.296 * S - 15.8
print("This text is suitable for grade " + str(index))
You can return multiple variables from within the function:
def length_finder(x):
...
return (c_char, c_word, c_sentence)
(c_char, c_word, c_sentence) = length_finder('input string')
You have two options.
Make those three variables global, and reference them in the function. You can make them global, then reference them as global within the function.
Return them from the function. You can return all 3 values as a dictionary or list.
I am trying to do the following:
1) calculate the amount of the same numbers in the data list. eg : there are three numbers between and including 10 and 20.
2) represent the value for each number range with the same number of '#'. eg: there are 3 numbers between 10 and 20 = ###.
Ideally ending in having the two values represented next to each other.
Unfortunately I really can't figure out step two and any help would really be appreciated.
My code is below:
def count_range_in_list(li, min, max):
ctr = 0
for x in li:
if min <= x <= max:
ctr += 1
return ctr
def amountOfHashes(count_range_in_list,ctr):
ctr = count_range_in_list()
if ctr == 1:
print ('#')
elif ctr == 2:
print ('##')
elif ctr == 3:
print ('###')
elif ctr == 4:
print ('####')
elif ctr == 5:
print ('#####')
elif ctr == 6:
print ('######')
elif ctr == 7:
print ('#######')
elif ctr == 8:
print ('########')
elif ctr == 9:
print ('#########')
elif ctr == 10:
print ('##########')
data = [90,30,13,67,85,87,50,45,51,72,64,69,59,17,22,23,44,25,16,67,85,87,50,45,51]
print(count_range_in_list(data, 0, 10),amountOfHashes)
print(count_range_in_list(data, 10, 20),amountOfHashes)
print(count_range_in_list(data, 20, 30),amountOfHashes)
print(count_range_in_list(data, 30, 40),amountOfHashes)
print(count_range_in_list(data, 40, 50),amountOfHashes)
print(count_range_in_list(data, 50, 60),amountOfHashes)
print(count_range_in_list(data, 60, 70),amountOfHashes)
print(count_range_in_list(data, 70, 80),amountOfHashes)
print(count_range_in_list(data, 80, 90),amountOfHashes)
print(count_range_in_list(data, 90, 100),amountOfHashes)
I'll start by clearing out some doubts you seem to have.
First, how to use the value of a function inside another one:
You don't need to pass the reference of a method to another here. What I mean is, in amountOfHashes(count_range_in_list,ctr) you can just drop count_range_in_list as a parameter, and just define it like amountOfHashes(ctr). Or better yet, use snake case in the method name instead of camel case, so you end up with amount_of_hashes(ctr). Even if you had to execute count_range_in_list inside amount_of_hashes, Python is smart enough to let you do that without having to pass the function reference, since both methods are inside the same file already.
And why do you only need ctr? Well, count_range_in_list already returns a counter, so that's all we need. One parameter, named ctr. In doing so, to "use the result from a function in a new one", we could:
def amount_of_hashes(ctr):
...
# now, passing the value of count_range_in_list in amount_of_hashes
amount_of_hashes(count_range_in_list(data, 10, 20))
You've figured out step 1) quite well already, so we can go to step 2) right away.
In Python it's good to think of iterative processes such as yours dynamically rather than in hard coded ways. That is, creating methods to check the same condition with a tiny difference between them, such as the ones in amountOfHashes, can be avoided in this fashion:
# Method name changed for preference. Use the name that best fits you
def counter_hashes(ctr):
# A '#' for each item in a range with the length of our counter
if ctr == 0:
return 'N/A'
return ''.join(['#' for each in range(ctr)])
But as noted by Roland Smith, you can take a string and multiply it by a number - that'll do exactly what you think: repeat the string multiple times.
>>> 3*'#'
###
So you don't even need my counter_hashes above, you can just ctr*'#' and that's it. But for consistency, I'll change counter_hashes with this new finding:
def counter_hashes(ctr):
# will still return 'N/A' when ctr = 0
return ctr*'#' or 'N/A'
For organization purposes, since you have a specific need (printing the hashes and the hash count) you may then want to format right what comes into print, you could make a specific method for the printing, that calls both counter_hashes and count_Range_in_list, and gives you a cleaner result afterwards:
def hash_range(data, min, max):
ctr = count_range_in_list(data, min, max)
hashes = counter_hashes(ctr)
print(f'{hashes} | {ctr} items in range')
The use and output of this would then become:
>>> data = [90,30,13,67,85,87,50,45,51,72,64,69,59,17,22,23,44,25,16,67,85,87,50,45,51]
>>> hash_range(data, 0, 10)
N/A | 0 items in range
>>> hash_range(data, 10, 20)
### | 3 items in range
>>> hash_range(data, 20, 30)
#### | 4 items in range
And so on. If you just want to print things right away, without the hash_range method above, it's simpler but more lengthy/repetitive if you want a oneliner:
>>> ctr = count_range_in_list(data, 10, 20)
>>> print(counter_hashes(ctr), ctr)
### 3
Why not just do it like this:
Python 3.x:
def amount_of_hashes(ctr):
while ctr > 0:
print('#', end = '')
ctr = ctr-1
Python 2.x:
def amount_of_hashes(ctr):
while ctr > 0:
print '#',
ctr = ctr-1
Counting the number in a list can be done like this:
def count_range_in_list(li, mini, maxi):
return len([i for i in li if mini <= i <= maxi])
Then making a number of hashes is even simpler. Just multiply a string containing the hash sign with a number.
print(ount_range_in_list(data, 0, 10)*'#')
Example in IPython:
In [1]: data = [90,30,13,67,85,87,50,45,51,72,64,69,59,17,22,23,44,25,16,67,85,87,50,45,51]
In [2]: def count_range_in_list(li, mini, maxi):
...: return len([i for i in li if mini <= i <= maxi])
...:
In [3]: print(count_range_in_list(data, 0, 10)*'#')
In [4]: print(count_range_in_list(data, 10, 20)*'#')
###
In [5]: print(count_range_in_list(data, 20, 30)*'#')
####
There are many ways to do this. One way is to use a for loop with range:
# Most basic
def count_range_in_list(li, min, max):
ctr = 0
hashes = ""
for x in li:
if min <= x <= max:
ctr += 1
hashes += "#"
print("There are {0} numbers = {1}".format(ctr, hashes))
# more declarative
def count_range_in_list(li, min, max):
nums = [x for x in li if min <= x <= max]
hashes = "".join(["#" for n in nums])
print("There are {0} numbers = {1}".format(len(nums), hashes))
I am trying to figure out how to take in a list of numbers and sort them into certain categories such as 0-10, 10-20, 20-30 and up to 90-100 but I have the code started, but the code isn't reading in all the inputs, but only the last one and repeating it. I am stumped, anyone help please?
def eScores(Scores):
count0 = 0
count10 = 0
count20 = 0
count30 = 0
count40 = 0
count50 = 0
count60 = 0
count70 = 0
count80 = 0
count90 = 0
if Scores > 90:
count90 = count90 + 1
if Scores > 80:
count80 = count80 + 1
if Scores > 70:
count70 = count70 + 1
if Scores > 60:
count60 = count60 + 1
if Scores > 50:
count50 = count50 + 1
if Scores > 40:
count40 = count40 + 1
if Scores > 30:
count30 = count30 + 1
if Scores > 20:
count20 = count20 + 1
if Scores > 10:
count10 = count10 + 1
if Scores <= 10:
count0 = count0 + 1
print count90,'had a score of (90 - 100]'
print count80,'had a score of (80 - 90]'
print count70,'had a score of (70 - 80]'
print count60,'had a score of (60 - 70]'
print count50,'had a score of (50 - 60]'
print count40,'had a score of (40 - 50]'
print count30,'had a score of (30 - 40]'
print count20,'had a score of (20 - 30]'
print count10,'had a score of (10 - 20]'
print count0,'had a score of (0 - 10]'
return eScores(Scores)
Each time eScores is called is sets all the counters (count10, count20) back to zero. So only the final call has any effect.
You should either declare the counters as global variables, or put the function into a class and make the counters member variables of the class.
Another problem is that the function calls itself in the return statement:
return eScores(Scores)
Since this function is (as I understand it) supposed to update the counter variables only, it does not need to return anything, let alone call itself recursively. You'd better remove the return statement.
One thing you're making a mistake on is that you're not breaking out of the whole set of if's when you go through. For example, if you're number is 93 it is going to set count90 to 1, then go on to count80 and set that to one as well, and so on until it gets to count10.
Your code is repeating because the function is infintely recursive (it has no stop condition). Here are the relevant bits:
def eScores(Scores):
# ...
return eScores(Scores)
I think what you'd want is more like:
def eScores(Scores):
# same as before, but change the last line:
return
Since you're printing the results, I assume you don't want to return the values of score10, score20, etc.
Also, the function won't accumulate results since you're creating new local counts each time the function is called.
Why don't you just use each number as a key (after processing) and return a dictionary of values?
def eScores(Scores):
return_dict = {}
for score in Scores:
keyval = int(score/10)*10 # py3k automatically does float division
if keyval not in return_dict:
return_dict[keyval] = 1
else:
return_dict[keyval] += 1
return return_dict