n_dicwords = [np.sum([c.lower().count(w.decode('utf-8')) for w in dictionary])
for c in documents]
Here I am trying to determine my feature engineering computation time:
By using this line of code, which goes through every document and checks whether or not and if yes then how many its words also appear in this dictionary that I have, it generates a feature called n_dicwords. Sorry I am such a noob to complexity theory, I think the time complexity for generating this feature is O(n* m*w) where n is the number of documents, m is the number of words in each document and w is the number of words in the dictionary. Am I right? And if so is there any way to improve this?
Thank you so much! I am really appreciated for your help!
Unless the code underneath your code does any clever stuff your complexity analysis should be correct.
If performance in this part is important you should use a multiple-pattern string search algorithm, which attempts to solve pretty much the exact problem you are doing.
To start with have a look at Aho-Corasick which is the most commonly used one and runs in linear time. Googling "Aho-Corasick python" turned up a few different implementations, so while I have not used any of them personally I would think you would not have to implement the algorithm itself to use it.
If you just need your code to run a little faster, and don't need to get the best performance you possibly could you could just use a set for the dictionary. In python a normal set is a hash set, so it has constant time lookup. Then you could just for each word check if it is in the dictionary.
I'm slightly surprised to note the the "x in s" construction in python is O(n), where n is the number of items n the list. So, your estimation is correct. A slightly more correct way of looking at it: Since your document or wor counts in said aren't changing at all, the important numbers are the total number of words which must be checked, and the length of the dictionary against which they are being checked. Obviously, this doesn't change the number of computations at all, it just gets us to a quickly recognizable form of O(m*n).
You could conceivably store your dictionary in a binary tree, which would reduce that to O(log(n)).
Search for "binary tree python" on Google, I was a few interesting things out there, like a package called "bintrees".
However, Erik Vesteraas points out the the python 'set' data structure is a hashed based collection, and has a complexity of O(1) in the average case, and O(n) in the worst, and highly rare case.
See https://docs.python.org/2/library/stdtypes.html#set
Related
I have a dataframe (more than 1 million rows) that has an open text columns for customer can write whatever they want.
Misspelled words appear frequently and I'm trying to group comments that are grammatically the same.
For example:
ID
Comment
1
I want to change my credit card
2
I wannt change my creditt card
3
I want change credit caurd
I have tried using Levenshtein Distance but computationally it is very expensive.
Can you tell me another way to do this task?
Thanks!
Levenshtein Distance has time complexity O(N^2).
If you define a maximum distance you're interested in, say m, you can reduce the time complexity to O(Nxm). The maximum distance, in your context, is the maximum number of typos you accept while still considering two comments as identical.
If you cannot do that, you may try to parallelize the task.
this is not a trivial task. If faced with this problem, my approach would be:
Tokenise your sentences. There are many ways to tokenise a sentence, the most straightforward way is to convert a sentence to a list of words. E.g. I want to change my credit card becomes [I, want, to, change, my, credit, card]. Another way is to roll a window of size n across your sentence, e.g. I want to becomes ['I w', ' wa', 'wan', 'ant', ...] for window size 3.
After tokenising your sentence, create an embedding (vectorising), i.e. convert your token to a vector of numbers. The most simple way is to use some ready-made library like sklearn's TfidfVectorizer. If your data cares about the order of the words, then a more sophisticated vectoriser is needed.
After vectorising, use a clustering algorithm. The most simple one is K-Means.
Of course, this is a very complicated task, and there could be a lot of ways to approach this problem. What I described is the simplest out-of-the-box solution. Some clever people have used different strategies to get better results. One example is https://www.youtube.com/watch?v=nlKE4gvJjMo. You need to do this research on this field on your own.
Edit: of course your approach is good for a small dataset. But the difficult part lies in how to perform better than a O(n^2) complexity.
I was given a problem in which you are supposed to write a python code that distributes a number of different weights among 4 boxes.
Logically we can't expect a perfect distribution as in case we are given weights like 10, 65, 30, 40, 50 and 60 kilograms, there is no way of grouping those numbers without making one box heavier than another. But we can aim for the most homogenous distribution. ((60),(40,30),(65),(50,10))
I can't even think of an algorithm to complete this task let alone turn it into python code. Any ideas about the subject would be appreciated.
The problem you're describing is similar to the "fair teams" problem, so I'd suggest looking there first.
Because a simple greedy algorithm where weights are added to the lightest box won't work, the most straightforward solution would be a brute force recursive backtracking algorithm that keeps track of the best solution it has found while iterating over all possible combinations.
As stated in #j_random_hacker's response, this is not going to be something easily done. My best idea right now is to find some baseline. I describe a baseline as an object with the largest value since it cannot be subdivided. Using that you can start trying to match the rest of the data to that value which would only take about three iterations to do. The first and second would create a list of every possible combination and then the third can go over that list and compare the different options by taking the average of each group and storing the closest average value to your baseline.
Using your example, 65 is the baseline and since you cannot subdivide it you know that has to be the minimum bound on your data grouping so you would try to match all of the rest of the values to that. It wont be great, but it does give you something to start with.
As j_random_hacker notes, the partition problem is NP-complete. This problem is also NP-complete by a reduction from the 4-partition problem (the article also contains a link to a paper by Garey and Johnson that proves that 4-partition itself is NP-complete).
In particular, given a list to 4-partition, you could feed that list as an input to a function that solves your box distribution problem. If each box had the same weight in it, a 4-partition would exist, otherwise not.
Your best bet would be to create an exponential time algorithm that uses backtracking to iterate over the 4^n possible assignments. Because unless P = NP (highly unlikely), no polynomial time algorithm exists for this problem.
I need to find all the days of the month where a certain activity occurs. The days when the activity occurs will be sequential. The sequence of days can range from one to the entire month, and the sequence will occur exactly one time per month.
To test whether or not the activity occurs on any given day is not an expensive calculation, but I thought I would use this problem learn something new. Which algorithm minimizes the number of days I have to test?
You can't really do much better than iterating through the sequence to find the first match, then iterating until the first non match. You can use itertools to make it nice and readable:
itertools.takewhile(mytest,
itertools.dropwhile(lambda x: not mytest(x), mysequence))
I think the linear probe suggested by #isbadawi is the best way to find the beginning of the subsequence. This is because the subsequence could be very short and could be anywhere within the larger sequence.
However, once the beginning of the subsequence is found, we can use a binary search to find the end of it. That will require fewer tests than doing a second linear probe, so it's a better solution for you.
As others have pointed out, there is not much practical reason for doing this. This is true for two reasons: your large sequence is quite short (only about 31 elements), and you still need to do at least one linear probe anyway, so the big-O runtime will be still be linear in the length of the large sequence, even though we have reduced part of the algorithm from linear to logarithmic.
The best method depends a bit on your input data structure. If your input data structure is a list of booleans for each day of the month then you can use the following code.
start = activity.find(True)
end = activity.rfind(True)
I would like to compare different methods of finding roots of functions in python (like Newton's methods or other simple calc based methods). I don't think I will have too much trouble writing the algorithms
What would be a good way to make the actual comparison? I read up a little bit about Big-O. Would this be the way to go?
The answer from #sarnold is right -- it doesn't make sense to do a Big-Oh analysis.
The principal differences between root finding algorithms are:
rate of convergence (number of iterations)
computational effort per iteration
what is required as input (i.e. do you need to know the first derivative, do you need to set lo/hi limits for bisection, etc.)
what functions it works well on (i.e. works fine on polynomials but fails on functions with poles)
what assumptions does it make about the function (i.e. a continuous first derivative or being analytic, etc)
how simple the method is to implement
I think you will find that each of the methods has some good qualities, some bad qualities, and a set of situations where it is the most appropriate choice.
Big O notation is ideal for expressing the asymptotic behavior of algorithms as the inputs to the algorithms "increase". This is probably not a great measure for root finding algorithms.
Instead, I would think the number of iterations required to bring the actual error below some epsilon ε would be a better measure. Another measure would be the number of iterations required to bring the difference between successive iterations below some epsilon ε. (The difference between successive iterations is probably a better choice if you don't have exact root values at hand for your inputs. You would use a criteria such as successive differences to know when to terminate your root finders in practice, so you could or should use them here, too.)
While you can characterize the number of iterations required for different algorithms by the ratios between them (one algorithm may take roughly ten times more iterations to reach the same precision as another), there often isn't "growth" in the iterations as inputs change.
Of course, if your algorithms take more iterations with "larger" inputs, then Big O notation makes sense.
Big-O notation is designed to describe how an alogorithm behaves in the limit, as n goes to infinity. This is a much easier thing to work with in a theoretical study than in a practical experiment. I would pick things to study that you can easily measure that and that people care about, such as accuracy and computer resources (time/memory) consumed.
When you write and run a computer program to compare two algorithms, you are performing a scientific experiment, just like somebody who measures the speed of light, or somebody who compares the death rates of smokers and non-smokers, and many of the same factors apply.
Try and choose an example problem or problems to solve that is representative, or at least interesting to you, because your results may not generalise to sitations you have not actually tested. You may be able to increase the range of situations to which your results reply if you sample at random from a large set of possible problems and find that all your random samples behave in much the same way, or at least follow much the same trend. You can have unexpected results even when the theoretical studies show that there should be a nice n log n trend, because theoretical studies rarely account for suddenly running out of cache, or out of memory, or usually even for things like integer overflow.
Be alert for sources of error, and try to minimise them, or have them apply to the same extent to all the things you are comparing. Of course you want to use exactly the same input data for all of the algorithms you are testing. Make multiple runs of each algorithm, and check to see how variable things are - perhaps a few runs are slower because the computer was doing something else at a time. Be aware that caching may make later runs of an algorithm faster, especially if you run them immediately after each other. Which time you want depends on what you decide you are measuring. If you have a lot of I/O to do remember that modern operating systems and computer cache huge amounts of disk I/O in memory. I once ended up powering the computer off and on again after every run, as the only way I could find to be sure that the device I/O cache was flushed.
You can get wildly different answers for the same problem just by changing starting points. Pick an initial guess that's close to the root and Newton's method will give you a result that converges quadratically. Choose another in a different part of the problem space and the root finder will diverge wildly.
What does this say about the algorithm? Good or bad?
I would suggest you to have a look at the following Python root finding demo.
It is a simple code, with some different methods and comparisons between them (in terms of the rate of convergence).
http://www.math-cs.gordon.edu/courses/mat342/python/findroot.py
I just finish a project where comparing bisection, Newton, and secant root finding methods. Since this is a practical case, I don't think you need to use Big-O notation. Big-O notation is more suitable for asymptotic view. What you can do is compare them in term of:
Speed - for example here newton is the fastest if good condition are gathered
Number of iterations - for example here bisection take the most iteration
Accuracy - How often it converge to the right root if there is more than one root, or maybe it doesn't even converge at all.
Input - What information does it need to get started. for example newton need an X0 near the root in order to converge, it also need the first derivative which is not always easy to find.
Other - rounding errors
For the sake of visualization you can store the value of each iteration in arrays and plot them. Use a function you already know the roots.
Although this is a very old post, my 2 cents :)
Once you've decided which algorithmic method to use to compare them (your "evaluation protocol", so to say), then you might be interested in ways to run your challengers on actual datasets.
This tutorial explains how to do it, based on an example (comparing polynomial fitting algorithms on several datasets).
(I'm the author, feel free to provide feedback on the github page!)
I'm not sure how exactly to word this question, so here's an example:
string1 = "THEQUICKBROWNFOX"
string2 = "KLJHQKJBKJBHJBJLSDFD"
I want a function that would score string1 higher than string2 and a million other gibberish strings. Note the lack of spaces, so this is a character-by-character function, not word-by-word.
In the 90s I wrote a trigram-scoring function in Delphi and populated it with trigrams from Huck Finn, and I'm considering porting the code to C or Python or kludging it into a stand-alone tool, but there must be more efficient ways by now. I'll be doing this millions of times, so speed is nice. I tried the Reverend.Thomas Beyse() python library and trained it with some all-caps-strings, but it seems to require spaces between words and thus returns a score of []. I found some Markov Chain libraries, but they also seemed to require spaces between words. Though from my understanding of them, I don't see why that should be the case...
Anyway, I do a lot of cryptanalysis, so in the future scoring functions that use spaces and punctuation would be helpful, but right now I need just ALLCAPITALLETTERS.
Thanks for the help!
I would start with a simple probability model for how likely each letter is, given the previous (possibly-null, at start-of-word) letter. You could build this based on a dictionary file. You could then expand this to use 2 or 3 previous letters as context to condition the probabilities if the initial model is not good enough. Then multiply all the probabilities to get a score for the word, and possibly take the Nth root (where N is the length of the string) if you want to normalize the results so you can compare words of different lengths.
I don't see why a Markov chain couldn't be modified to work. I would create a text file dictionary of sorts, and read that in to initially populate the data structure. You would just be using a chain of n letters to predict the next letter, rather than n words to predict the next word. Then, rather than randomly generating a letter, you would likely want to pull out the probability of the next letter. For instance if you had the current chain of "TH" and the next letter was "E", you would go to your map, and see the probability that an "E" would follow "TH". Personally I would simply add up all of these probabilities while looping through the string, but how to exactly create a score from the probability is up to you. You could normalize it for string length, to let you compare short and long strings.
Now that I think about it, this method would favor strings with longer words, since a dictionary would not include phrases. Then again, you could populate the dictionary with not only single words, but short phrases with the spaces removed as well. Then the scoring would not only score based on how english the seperate words are, but how english series of words are. It's not a perfect system, but it would provide consistent scoring.
I don't know how it works, but Mail::SpamAssassin::Plugin::TextCat analyzes email and guesses what language it is (with dozens of languages supported).
The Index of Coincidence might be of help here, see https://en.wikipedia.org/wiki/Index_of_coincidence.
For a start just compute the difference of the IC to the expected value of 1.73 (see Wikipedia above). For an advanced usage you might want to calculate the expected value yourself using some example language corpus.
I'm thinking that maybe you could apply some text-to-speech synthesis ideas here. In particular, if a speech synthesis program is able to produce a pronunciation for a word, then that can be considered "English."
The pre-processing step is called grapheme-to-phoneme conversion, and typically leads to probabilities of mapping strings to sounds.
Here's a paper that describes some approaches to this problem. (I don't claim this paper is authoritative, as it just was a highly ranked search result, and I don't really have expertise in this area.)