Longest Increasing Subsequence code in O(N)? - python

Someone asked me a question
Find the longest alphabetically increasing or equal string
composed of those letters. Note that you are allowed to drop
unused characters.
So ghaaawxyzijbbbklccc returns aaabbbccc.
Is an O(n) solution possible?
and I implemented it code [in python]
s = 'ghaaawxyzijbbbklccc'
lst = [[] for i in range(26)]
for ch in s:
ml = 0
for i in range(0,ord(ch) + 1 - ord('a')):
if len(lst[i]) > len(lst[ml]):
ml= i
cpy = ''.join(lst[ml])
lst[ord(ch) - ord('a')] = cpy + ch
ml = 0
for i in range(26):
if len(lst[i]) > len(lst[ml]):
ml = i
print lst[ml]
and the answer is 'aaabbbccc'
I have tried this some more examples and all works!
and as far as I can think the complexity of this code is O(N)
let's take an example
suppose I have a string 'zzzz'
so the main loop will run 4 times and internal loop will run 26 times for each iteration so we can say in worst case the code will run in
O(26*N + 26)
---------^-
this is the last iteration
so O(N) is acceptable?
Now questions are
Is it works in O(N) my code at ideone
If it works in O(N) then why to use DP of O(N2) code of DP
Is it better then this code Friends code
Limitations of this code

It's O(N)
'why to use DP of O(N2)' : You don't need to for this problem. Note, though, that you take advantage of the fact that your sequence tokens (letters) are finite - so you can set up a list to hold all the possible starting values (26) and you need only look for the longest member of that list - an O(1) operation. A more generalised solution for sequences with an arbitrary number of ordered tokens can be done in O(NlogN).
Your friend's code is basically the same, just mapping the letters to numbers and their list for the 26 starting places holds 26 numbers for letter counts - they don't need to do either of those. Conceptually, though, it's the same thing - holding a list of lists.
"Better" is a matter of opinion. Although it has the same asymptotic complexity, the constant terms may be different, so one may execute faster than the other. Also, in terms of storage, one may use very slightly more memory than the other. With such low n - judging which is more readable may be more important than the absolute performance of either algorithm. I'm not going to make a judgement.
You might notice a slight difference where the "winning" sequence is a tie. For instance - on the test string edxeducation that you have there - your implementation returns ddin whereas your friend's returns ddio. Both seem valid to me - without a rule to break such ties.
The major limitation of this code is that it can only cope with sequences composed entirely of letters in one particular case. You could extend it to cope with upper and lower case letters, either treating them the same, or using an ordering where all lower case letters were "less than" all upper case letters or something similar. This is just extending the finite set of tokens that it can cope with.
To generalise this limitation - the code will only cope with finite sets of sequence tokens as noted in 2. above. Also - there is no error handling, so if you put in a string with, say, digits or punctuation, it will fail.

This is a variation of the Longest Increasing Subsequence.
The difference is that your elements are bounded, since they can only run from 'a' to 'z'. Your algorithm is indeed O(N). O(N log N) is possible, for example using the algorithm from the link above. The bound on the number of possible elements turns this into O(N).

Related

How can I mark an index as 'used' when iterating over a list?

I will iterate over a list of integers, nums, multiple times, and each time, when an integer has been 'used' for something (doesn't matter what), I want to mark the index as used. So that in future iterations, I do not use this integer again.
Two questions:
My idea is to simply create a separate list marker = [1]*len(nums) ; and each time I use a number in nums, I will subtract 1 from the corresponding index in marker as a way to keep track of the numbers in nums I have used.
My first question is, is there a well known efficient way to do this? As I believe this would make the SPACE COMPLEXITY O(n)
My other idea is to replace each entry in nums, like this. nums = [1,2,3,4] -> nums = [(1,1),(2,1),(3,1),(4,1)]. And each time I use an integer in nums, I would subtract 1 from the second index in each pair as a way of marking that it has been used. My question is, am I right in understanding that this would optimise the SPACE COMPLEXITY relative to solution 1. above? And the SPACE COMPLEXITY here would be O(1)?
For reference, I am solving the following question: https://leetcode.com/contest/weekly-contest-256/problems/minimum-number-of-work-sessions-to-finish-the-tasks/
Where each entry in tasks needs to be used once.
I don't think there is a way to do it in O(1) space. Although, I believe that using a boolean value instead of an integer value or using the concept of sets would be a better solution.
No, the space complexity is still O(n). Think about it like this. Let us assume n is the size of the list. In the first method that you mentioned, we are storing n 'stuff' separately. So, the space complexity is O(n). In the second method also, we are storing n 'stuff' separately. It's just that those n 'stuff' are being stored as part of the same array. So, the space complexity still remains the same which is O(n).
Firstly, In both cases, the space Complexity comes out to be O(n). This is because nums itself utilizes O(n) space whether or not you use a separate list to store usage of elements. So space complexity in any way cannot come down to O(1).
However, here is a suggestion.
If you don't want to use the used element again then why not just remove it from the list.
Or, in case you don't want to disrupt the indexing, just change the number to -1.

I know the length and restricted character set of some CRC32 hashes. Does this make it easier to reverse them?

We have a bunch of CRC32 hashes which would be really nice to know the input of. Some of them are short enough that brute-forcing them is feasible; others are not. The CRC32 algorithm being used is equal to the one in Python's binascii (its implementation is spelled out at https://rosettacode.org/wiki/CRC-32#Python).
I've read these pages:
http://www.danielvik.com/2010/10/calculating-reverse-crc.html
http://www.danielvik.com/2013/07/rewinding-crc-calculating-crc-backwards.html
...and it seems to me that there's something hidden somewhere that can reduce the amount of effect to reverse these things that I just can't puzzle out.
The main reason I think we can do better than full brute force is that we know two things about the hash inputs:
We know the length of every input. I believe this means that as soon as we find a guess that equals length and has a reversed CRC value of 0, that must be correct (might not help much though). Maybe there's some other property of length caused by the algorithm that could cut down on effort that I don't see.
We also know that every input has a restricted character set of [A-Za-z_0-9] (only letters, numbers, and underscores). In addition, we know that numbers are rare, and A-Z do not appear to ever be mixed with a-z, so we can often get away with just [a-z_] in 90% of cases.
Also, the majority of these are snake_case English words/phrases (e.g. "weight" or "turn_around"). So we can also filter out any guesses that contain "qxp" or such.
Since the above links discuss how you can add any 4 chars to an input and leave the hash unchanged, one idea I thought of is to brute force the last 4 chars (most of which are invalid because they're not in the restricted charset), filter out the ones that are clearly not right (because of illegal 3-letter English combos and such), and come up with a "short"list of potentially valid last 4 chars. Then we repeat until the whole thing has been whittled down, which should be faster than pure brute force. But I can't find the leap in logic to figure out how.
If I'm going off in the wrong direction to solve this, that would be good to know as well.
the majority of these are snake_case English words/phrases (e.g. "weight" or "turn_around") - These ones could be brute-forced by using dictionary (e.g from this question) and utilities. Assuming total amount of English words up to 1M, trying (1M)^2 CRC32 looks feasible and quite fast.
Given a text file with all dictionary words, enumerating all word_word and comparing with CRC hashes could be done with e.g. Hashcat tool with instruction from here as:
hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt dictionary.txt dictionary.txt
and just testing against each word in dictionary as:
hashcat64.bin -m 11500 -a 0 hashes.txt dictionary.txt
For phrases longer than 2 words, each phrase length would be an individual case as e.g. Hashcat has no option to permutate 3 or more dictionaries (ref) . For 3-words phrases you need to generate a file with 2-words combination first (e.g. as here but in form of {}_{} ). Then combine it with 1-word dictionary: hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt two_words_dictionary.txt dictionary.txt . Next, 4-words phrases could be brute forced as: hashcat64.bin -m 11500 -a 1 -j '$_' hashes.txt two_words_dictionary.txt two_words_dictionary.txt , etc. (Another option would be to pipe combinations to Hashcat as combine_scrip ... | hashcat64.bin -m 11500 -a 0 hashes.txt but as CRC32 is very fast to check, pipe would be a bottleneck here, using dictionary files would be much faster than piping)
Of course n-words permutation increases complexity exponentially over n (with huge base). But as dictionary is limited to some subset instead of all English words, it depends on the size of dictionary how deep is practical to go with brute force.
Let me take another direction on the question and talk about the nature of CRC.
As you know Cyclic Redundancy Check is something calculated by dividing the message (considered as a polynomial in GF(2)) by some primitive value, and it is by nature linear (a concept borrowed from coding theory).
Let me explain what I mean by linear.
Let's assume we have three messages A, B, C.
then if we have
CRC(A) = CRC(B)
then we can say
CRC(A^C)=CRC(B^C)
(meaning that CRC will be changed based on XOR).
Note that CRC is not a hash and its behaviour can be predicted.
So you don't need a complicated tool like hashcat if your space is too big.
So theoretically you can find the linear space that CRC(x) = b, by setting
x = b0 + nullspace.
b0 is some string that satisfies
CRC(b0) = expectedCRC.
(Another note, because these systems usually have the initial condition and final xor implemented in them and it means that CRC(0)!=0).
Then you can reorder the nullspace to be localized.
And then knowing your space to contain only ASCII characters and conditional sequence of characters you can search your space easily.
knowing that your space is pow(2,32) and your possible input is about pow(10,12) I would say there are too many texts that map into the same CRC.

Finding the end of a contiguous substring of a string without iteration or RegEx

I'm trying to write an iterative LL(k) parser, and I've gotten strings down pretty well, because they have a start and end token, and so you can just "".join(tokenlist[string_start:string_end]).
Numbers, however, do not, and only consist of .0123456789. They can occur at any given point in a program, have any arbitrary length and are delimited purely by non-numerals.
Some examples, because that definition is pretty vague:
56 123.45/! is 56 and 123.45 followed by two other tokens
565.5345.345 % is 565.5345, 0.345 and two other tokens (incl. whitespace)
The problem I'm trying to solve is how the parser should figure out where a numeric literal ends. (Note that this is a context-free, self-modifying interpretive grammar thus there is no separate lexical analysis to be done.)
I could and have solved this with iteration:
def _next_notinst(self, atindex, subs = DIGITS):
"""return the next index of a char not in subs"""
for i, e in enumerate(self.toklist[atindex:]):
if e not in subs:
return i - len(self.toklist)
else:
break
return self.idx.v
(I don't think I need to clarify the variables, since it's an example and extremely straightforward.)
Great! That works, but there are at least two issues:
It's O(n) for a number with digit-length n. Not ideal.*
The parser class of which this method is a member is already using a while True: to cycle over arbitrary parts of the string, and I would prefer not having remotely nested loops when I don't need to.
From the previous bullet: since the parser uses arbitrary k lookahead and skipahead, parsing each individual token is absolutely not what I want.
I don't want to use RegEx mostly because I don't know it, and using it for this right now would make my code uncomprehendable to me, its creator.
There must be a simple, < O(n) solution to this, that simply collects the contiguous numerals in a string given a starting point, up until a non-numeral.
*Yes, I'm fully aware the parser itself is O(n), but we don't also need the number catenator to be > O(n). If you don't believe me, the string catenator is O(1) because it simply looks for the next unescaped " in the program and then joins all the chars up to that. Can't I do the same thing for numbers?
My other answer was actually erroneous due to lack of testing.
I decided to suck it up and learn a little bit of RegEx just because it's the only other way to solve this.
^([.\d]+[.\d]+|[.\d]) works for what I want, and matches these:
123.43.453""
.234234!/%
but not, for example:
"1233

Anagram Search Algorithm Comparison in Python [Tutorial Homework]

I am making a simple algorithm sort in Python using defaultdict, creating a hash to use as the key and then just going through the dictionary afterwards and printing out anything with more than a single value.
Originally started by creating the hash by creating a sorted string using:
def createHashFromFile(fileName):
with open(fileName) as fileObj:
for line in fileObj:
line = line.lower()
aHash = ("").join(sorted(line.strip()))
aSorter[aHash].append(line.strip())
However, because the sorted() function is O(n^2) have had the suggestion made to create the hash through prime factorization. I created a dictionary that has mapped all the lower case letters to a prime and then done:
def keyHash(word):
mulValue = 1
for letter in word:
letter = letter.lower()
mulValue = mulValue * primeDict[letter]
return mulValue
On 300k words, the string hash runs in 0.75s and the prime hash runs in 1s. I've been reading up on this but I'm unable to determine if I've missed anything with this or why it is running slower.
This is already completed as far as the homework is concerned but I want to understand why or what I am missing here.
There's a whole bunch of factors going on here:
sorted is O(n log n) average case rather than O(n^2). Worst case of sort is almost never relevant in real programs.
multiplying primes together is a clever trick, but while it's O(n) cost in multiplications, multiplying a big number N by a small factor is going to be O(log N) in cost rather than O(1) (since you have to go through the O(log N) digits of the bignum). This means the prime technique is going to be O(n log n) too, because keyHash(s) is going to have O(len(s)) digits.
n is small, so implementation details are going to matter a whole lot more than complexity.
sorted is built-in and written in C. The implementation has been tuned over many years. Your prime multiplication code is written in Python.
You don't say in your question how you performed the timing. It's very easy to get this wrong, by, for example, timing the whole program rather than a micro-benchmark. Given the closeness of the results, I expect you've made such an error, but it's a guess.

Most efficient way to check if any substrings in list are in another list of strings

I have two lists, one of words, and another of character combinations. What would be the fastest way to only return the combinations that don't match anything in the list?
I've tried to make it as streamlined as possible, but it's still very slow when it uses 3 characters for the combinations (goes up to 290 seconds for 4 characters, not even going to try 5)
Here's some example code, currently I'm converting all the words to a list, and then searching the string for each list value.
#Sample of stuff
allCombinations = ["a","aa","ab","ac","ad"]
allWords = ["testing", "accurate" ]
#Do the calculations
allWordsJoined = ",".join( allWords )
invalidCombinations = set( i for i in allCombinations if i not in allWordsJoined )
print invalidCombinations
#Result: set(['aa', 'ab', 'ad'])
I'm just curious if there's a better way to do this with sets? With a combination of 3 letters, there are 18278 list items to search for, and for 4 letters, that goes up to 475254, so currently my method isn't really fast enough, especially when the word list string is about 1 million characters.
Set.intersection seems like a very useful method if you need the whole string, so surely there must be something similar to search for a substring.
The first thing that comes to mind is that you can optimize lookup by checking current combination against combinations that are already "invalid". I.e. if ab is invalid, than ab.? will be invalid too and there's no point to check such.
And one more thing: try using
for i in allCombinations:
if i not in allWordsJoined:
invalidCombinations.add(i)
instead of
invalidCombinations = set(i for i in allCombinations if i not in allWordsJoined)
I'm not sure, but less memory allocations can be a small boost for real data run.
Seeing if a set contains an item is O(1). You would still have to iterate through your list of combinations (with some exceptions. If your word doesn't have "a" it's not going to have any other combinations that contain "a". You can use some tree-like data structure for this) to compare with your original set of words.
You shouldn't convert your wordlist to a string, but rather a set. You should get O(N) where N is the length of your combinations.
Also, I like Python, but it isn't the fastest of languages. If this is the only task you need to do, and it needs to be very fast, and you can't improve the algorithm, you might want to check out other languages. You should be able to very easily prototype something to get an idea of the difference in speed for different languages.

Categories