What is an efficient way to check that a string s in Python consists of just one character, say 'A'? Something like all_equal(s, 'A') which would behave like this:
all_equal("AAAAA", "A") = True
all_equal("AAAAAAAAAAA", "A") = True
all_equal("AAAAAfAAAAA", "A") = False
Two seemingly inefficient ways would be to: first convert the string to a list and check each element, or second to use a regular expression. Are there more efficient ways or are these the best one can do in Python? Thanks.
This is by far the fastest, several times faster than even count(), just time it with that excellent mgilson's timing suite:
s == len(s) * s[0]
Here all the checking is done inside the Python C code which just:
allocates len(s) characters;
fills the space with the first character;
compares two strings.
The longer the string is, the greater is time bonus. However, as mgilson writes, it creates a copy of the string, so if your string length is many millions of symbols, it may become a problem.
As we can see from timing results, generally the fastest ways to solve the task do not execute any Python code for each symbol. However, the set() solution also does all the job inside C code of the Python library, but it is still slow, probably because of operating string through Python object interface.
UPD: Concerning the empty string case. What to do with it strongly depends on the task. If the task is "check if all the symbols in a string are the same", s == len(s) * s[0] is a valid answer (no symbols mean an error, and exception is ok). If the task is "check if there is exactly one unique symbol", empty string should give us False, and the answer is s and s == len(s) * s[0], or bool(s) and s == len(s) * s[0] if you prefer receiving boolean values. Finally, if we understand the task as "check if there are no different symbols", the result for empty string is True, and the answer is not s or s == len(s) * s[0].
>>> s = 'AAAAAAAAAAAAAAAAAAA'
>>> s.count(s[0]) == len(s)
True
This doesn't short circuit. A version which does short-circuit would be:
>>> all(x == s[0] for x in s)
True
However, I have a feeling that due the the optimized C implementation, the non-short circuiting version will probably perform better on some strings (depending on size, etc)
Here's a simple timeit script to test some of the other options posted:
import timeit
import re
def test_regex(s,regex=re.compile(r'^(.)\1*$')):
return bool(regex.match(s))
def test_all(s):
return all(x == s[0] for x in s)
def test_count(s):
return s.count(s[0]) == len(s)
def test_set(s):
return len(set(s)) == 1
def test_replace(s):
return not s.replace(s[0],'')
def test_translate(s):
return not s.translate(None,s[0])
def test_strmul(s):
return s == s[0]*len(s)
tests = ('test_all','test_count','test_set','test_replace','test_translate','test_strmul','test_regex')
print "WITH ALL EQUAL"
for test in tests:
print test, timeit.timeit('%s(s)'%test,'from __main__ import %s; s="AAAAAAAAAAAAAAAAA"'%test)
if globals()[test]("AAAAAAAAAAAAAAAAA") != True:
print globals()[test]("AAAAAAAAAAAAAAAAA")
raise AssertionError
print
print "WITH FIRST NON-EQUAL"
for test in tests:
print test, timeit.timeit('%s(s)'%test,'from __main__ import %s; s="FAAAAAAAAAAAAAAAA"'%test)
if globals()[test]("FAAAAAAAAAAAAAAAA") != False:
print globals()[test]("FAAAAAAAAAAAAAAAA")
raise AssertionError
On my machine (OS-X 10.5.8, core2duo, python2.7.3) with these contrived (short) strings, str.count smokes set and all, and beats str.replace by a little, but is edged out by str.translate and strmul is currently in the lead by a good margin:
WITH ALL EQUAL
test_all 5.83863711357
test_count 0.947771072388
test_set 2.01028490067
test_replace 1.24682998657
test_translate 0.941282987595
test_strmul 0.629556179047
test_regex 2.52913498878
WITH FIRST NON-EQUAL
test_all 2.41147494316
test_count 0.942595005035
test_set 2.00480484962
test_replace 0.960338115692
test_translate 0.924381017685
test_strmul 0.622269153595
test_regex 1.36632800102
The timings could be slightly (or even significantly?) different between different systems and with different strings, so that would be worth looking into with an actual string you're planning on passing.
Eventually, if you hit the best case for all enough, and your strings are long enough, you might want to consider that one. It's a better algorithm ... I would avoid the set solution though as I don't see any case where it could possibly beat out the count solution.
If memory could be an issue, you'll need to avoid str.translate, str.replace and strmul as those create a second string, but this isn't usually a concern these days.
You could convert to a set and check there is only one member:
len(set("AAAAAAAA"))
Try using the built-in function all:
all(c == 'A' for c in s)
If you need to check if all the characters in the string are same and is equal to a given character, you need to remove all duplicates and check if the final result equals the single character.
>>> set("AAAAA") == set("A")
True
In case you desire to find if there is any duplicate, just check the length
>>> len(set("AAAAA")) == 1
True
Adding another solution to this problem
>>> not "AAAAAA".translate(None,"A")
True
Interesting answers so far. Here's another:
flag = True
for c in 'AAAAAAAfAAAA':
if not c == 'A':
flag = False
break
The only advantage I can think of to mine is that it doesn't need to traverse the entire string if it finds an inconsistent character.
not len("AAAAAAAAA".replace('A', ''))
Related
I am trying to write a function to check whether a string is palindrome or not, but every string is showing as palindrome
def is_palindrome(input_string):
x=0
reverse_string = ""
while x<len(input_string):
reverse_string+=input_string[x]
x=x+1
if input_string == reverse_string:
return True
else:
return False
print(is_palindrome("abc")) # Should be False but it return True
In your code, the variabile "reverse_string" will always be equal to "input_string" since you are just appending the characters in the same order with the += operator.
A simple way to reverse a string in Python is to use slicing like that:
def is_palindrome(input_string):
if input_string == input_string[::-1]:
return True
return False
input_string[::-1] means "start from the first index to the last in the reverse order (-1)"
Your problem is in the reversal of the string. (your x is going from 0 to len(input_string)-1 but it should go the other way)
That's why it's important to break code into functions that do one and only one thing (at least in the beginning)
In this case is an overkill, but it will help you when your code grows more complex.
your function can then be simplified as:
def is_palindrome(input_string):
return input_string == reverse_string(input_string)
If you look at it is self explanatory. Is the input string equal to its reverse?
Now we need to implement the function reverse_string.
The advantage of having a function that just reverses a string is that we can do a lot of tests on it to check just this particular function
In your case, you can use negative indexes, or you can start with the index set to len(input_string)-1 and go towards 0.
But it's also a good moment to learn about string slicing and how to do things in a pythonic way, so the reverse function can be written as:
def reverse_string(input_string):
return input_string[::-1]
Feel free to put your own implementation of reverse_string if you are not yet confident with string slicing, but with this function you have separated two different things: reversing a string and checking is string is a palindrome. You can even reuse that reverse_string function later on.
Now we can test it with many cases until we are confident that it works as expected.
I'd recommend taking a look at unit tests it might seem too much for such an easy problem, but it will help you a lot in the future.
Just test what happens if you pass a palindrome, a non-palindrome, an empty string, a number, a None...
I'm trying to build logic in programming. I need to write a python function that takes string as input and checks whether any character appears more than once. The function should return True if there are no repetitions and False otherwise. I have searched online and found several examples related to it. I wrote the code and it seemed fine initially but then I realized my mistake and now I'm not getting how should I go about it. Please guide
def repfree(S):
for char in S:
if S.count(char) > 1:
return True
return False
Here, you can create a character list to keep track of the characters that have already occurred in S.
Have a look at the code below, hope it helps:>
def repfree(S):
freq = []
for char in S:
# if the character is already in list that means S contains repeated char
if char in freq:
return False
else:
freq.append(char)
return True
How about you use a set
def repfree(s):
char_set = set()
for c in s:
char_set.append(c)
return len(char_set) == len(s)
You can try on the following code.
def rep_free(text):
return len(text) != len(set(text))
A set object is an unordered collection of distinct hashable objects. Common uses include membership testing, removing duplicates from a sequence, and computing mathematical operations such as intersection, union, difference, and symmetric difference. (For other containers see the built-in dict, list, and tuple classes, and the collections module.)
From python doc
This should be done using set data structure of python, set is having a property where it will always have unique characters only.
string="country"
def repetion(string):
if len(set(string)) == len(string):
print("string is having unique chars")
else:
print("chars are different")
Please keep in mind that I am still fairly new to Python. I have this question which I have a fair bit of trouble understanding. I have made an attempt at this problem:
def Sample(A):
for i in range(len(A)):
if A == 1:
print A
elif A == (-1):
print A
Question:
Write a function where A is a list of strings, as of such print all the strings in A that start with '-1' or '1'
In your if and elif you are testing A, i.e. whether the whole list is equal to the value, which will never be True. Instead, test, each item in A. You can either stick with your index:
for i in range(len(A)):
if A[i] == ...
or, better, iterate over A directly:
for item in A:
if item == ...
Next, to test whether a string starts with a character, use str.startswith:
for item in A:
if item.startswith("1"):
...
Note that this uses the string "1", rather than the integer 1.
You're comparing if A equals 1 or -1, which is a bad start. You should be checking if i starts with either of those.
if i.startwith("1"):
Edit:
I completely edited my answer since I had misunderstood the question the first time.
You need to test for two cases does a_string in A start with '1' or start with -1.
Python offers a number ways to do this. First, is the string.startswith('something') method. This checks to see if the string startswith something you specified.
def Sample(A):
for each_string in A:
if each_string.startswith(('1','-1')):
print each_string
I'm trying to make a glob-like expansion of a set of DNA strings that have multiple possible bases.
The base of my DNA strings contains the letters A, C, G, and T. However, I can have special characters like M which could be an A or a C.
For example, say I have the string:
ATMM
I would like to take this string as input and output the four possible matching strings:
ATAA
ATAC
ATCA
ATCC
Rather than brute force a solution, I feel like there must be some elegant Python/Perl/Regular Expression trick to do this.
Thank you for any advice.
Edit, thanks cortex for the product operator. This is my solution:
Still a Python newbie, so I bet there's a better way to handle each dictionary key than another for loop. Any suggestions would be great.
import sys
from itertools import product
baseDict = dict(M=['A','C'],R=['A','G'],W=['A','T'],S=['C','G'],
Y=['C','T'],K=['G','T'],V=['A','C','G'],
H=['A','C','T'],D=['A','G','T'],B=['C','G','T'])
def glob(str):
strings = [str]
## this loop visits very possible base in the dictionary
## probably a cleaner way to do it
for base in baseDict:
oldstrings = strings
strings = []
for string in oldstrings:
strings += map("".join,product(*[baseDict[base] if x == base
else [x] for x in string]))
return strings
for line in sys.stdin.readlines():
line = line.rstrip('\n')
permutations = glob(line)
for x in permutations:
print x
Agree with other posters that it seems like a strange thing to want to do. Of course, if you really want to, there is (as always) an elegant way to do it in Python (2.6+):
from itertools import product
map("".join, product(*[['A', 'C'] if x == "M" else [x] for x in "GMTTMCA"]))
Full solution with input handling:
import sys
from itertools import product
base_globs = {"M":['A','C'], "R":['A','G'], "W":['A','T'],
"S":['C','G'], "Y":['C','T'], "K":['G','T'],
"V":['A','C','G'], "H":['A','C','T'],
"D":['A','G','T'], "B":['C','G','T'],
}
def base_glob(glob_sequence):
production_sequence = [base_globs.get(base, [base]) for base in glob_sequence]
return map("".join, product(*production_sequence))
for line in sys.stdin.readlines():
productions = base_glob(line.strip())
print "\n".join(productions)
You probably could do something like this in python using the yield operator
def glob(str):
if str=='':
yield ''
return
if str[0]!='M':
for tail in glob(str[1:]):
yield str[0] + tail
else:
for c in ['A','G','C','T']:
for tail in glob(str[1:]):
yield c + tail
return
EDIT: As correctly pointed out I was making a few mistakes. Here is a version which I tried out and works.
This isn't really an "expansion" problem and it's almost certainly not doable with any sensible regular expression.
I believe what you're looking for is "how to generate permutations".
You could for example do this recursively. Pseudo-code:
printSequences(sequence s)
switch "first special character in sequence"
case ...
case M:
s1 = s, but first M replaced with A
printSequences(s1)
s2 = s, but first M replaced with C
printSequences(s2)
case none:
print s;
Regexps match strings, they're not intended to be turned into every string they might match.
Also, you're looking at a lot of strings being output from this - for instance:
MMMMMMMMMMMMMMMM (16 M's)
produces 65,536 16 character strings - and I'm guessing that DNA sequences are usually longer than that.
Arguably any solution to this is pretty much 'brute force' from a computer science perspective, because your algorithm is O(2^n) on the original string length. There's actually quite a lot of work to be done.
Why do you want to produce all the combinations? What are you going to do with them? (If you're thinking to produce every string possibility and then look for it in a large DNA sequence, then there are much better ways of doing that.)
I'm trying to find the missing letter in the alphabet from the list with the least lines of code.
If the list is sorted already (using list.sort()), what is the fastest or least lines of code to find the missing letter.
If I know there are only one missing letter.
(This is not any type of interview questions. I actually need to do this in my script where I want to put least amount of work in this process since it will be repeated over and over indeterministically)
Some questions:
Are all the letters upper or lower case? (a/A)
Is this the only alphabet you'll want to check?
Why are you doing this so often?
Least lines of code:
# do this once, outside the loop
alphabet=set(string.ascii_lowercase)
# inside the loop, just 1 line:
missingletter=(alphabet-set(yourlist)).pop()
The advantage of the above is that you can do it without having to sort the list first. If, however, the list is always sorted, you can use bisection to get there faster. On a simple 26-letter alphabet though, is there much point?
Bisection (done in ~4 lookups):
frompos, topos = 0, len(str)
for i in range(1,100): #never say forever with bisection...
trypos = (frompos+topos+1)/2
print "try:",frompos,trypos,topos
if alphabet[trypos] != str[trypos]:
topos = trypos
else:
frompos = trypos
if topos-frompos==1:
if alphabet[topos] != str[topos]:
print alphabet[frompos]
else:
print alphabet[topos]
break
This code requires fewer lookups, so is by far the better scaling version O(log n), but may still be slower when executed via a python interpreter because it goes via python ifs instead of set operations written in C.
(Thanks to J.F.Sebastian and Kylotan for their comments)
Using a list comprehension:
>>> import string
>>> sourcelist = 'abcdefghijklmnopqrstuvwx'
>>> [letter for letter in string.ascii_lowercase if letter not in sourcelist]
['y', 'z']
>>>
The string module has some predefined constants that are useful.
>>> string.ascii_lowercase
'abcdefghijklmnopqrstuvwxyz'
>>> string.letters
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
>>> string.hexdigits
'0123456789abcdefABCDEF'
>>> string.octdigits
'01234567'
>>> string.digits
'0123456789'
>>> string.printable
'0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ!"#$%&\'()*+,-./:;<=>?#[\\]^_`{|}~ \t\n\r\x0b\x0c'
>>>
In the too clever for it's own good category, and assuming there is exactly one missing letter in a lowercase alphabet:
print chr(2847 - sum(map(ord, theString)))
[Edit]
I've run some timings on the various solutions to see which is faster.
Mine turned out to be fairly slow in practice (slightly faster if I use itertools.imap instead).
Surprisingly, the listcomp solution by monkut turned out to be fastest - I'd have expected the set solutions to do better, as this must scan the list each time to find the missing letter.
I tried first converting the test list to a set in advance of membership checking, expecting this to speed it up but in fact it made it slower. It looks like the constant factor delay in creating the set dwarfs the cost of using an O(n**2) algorithm for such a short string.
That suggested than an even more basic approach, taking advantage of early exiting, could perform even better. The below is what I think currently performs best:
def missing_letter_basic(s):
for letter in string.ascii_lowercase:
if letter not in s: return letter
raise Exception("No missing letter")
The bisection method is probably best when working with larger strings however. It is only just edged out by the listcomp here, and has much better asymptotic complexity, so for strings larger than an alphabet, it will clearly win.
[Edit2]
Actually, cheating a bit, I can get even better than that, abusing the fact that there are only 26 strings to check, behold the ultimate O(1) missing letter finder!
find_missing_letter = dict((string.ascii_lowercase[:i]+string.ascii_lowercase[i+1:],
string.ascii_lowercase[i]) for i in range(26)).get
>>> find_missing_letter('abcdefghijklmnoprstuvwxyz')
'q'
Here are my timings (500000 runs, tested with letters missing near the start, middle and end of the string (b, m and y)
"b" "m" "y"
bisect : 2.762 2.872 2.922 (Phil H)
find_gap : 3.388 4.533 5.642 (unwind)
listcomp : 2.832 2.858 2.822 (monkut)
listcomp_set : 4.770 4.746 4.700 As above, with sourcelist=set(sourcelist) first
set_difference : 2.924 2.945 2.880 (Phil H)
sum : 3.815 3.806 3.868
sum_imap : 3.284 3.280 3.260
basic : 0.544 1.379 2.359
dict_lookup : 0.135 0.133 0.134
Here's one way of doing it, assuming your "alphabets" is integers, and that the list has at least two items:
for i in xrange(1, len(a)):
if a[i] != a[i - 1] + 1:
print a[i - 1] + 1, "is missing"
With sorted lists a binary search is usually the fastest alghorythm. Could you please provide an example list and an example "missing alphabet"?
if you're talking about alphabet as letters:
letterSet = set()
for word in wordList:
letterSet.update(set(word.lower()))
import string
alphabet = set(string.lowercase)
missingLetters = alphabet.difference(letterSet)
class MissingFinder(object):
"A simplified missing items locator"
def __init__(self, alphabet):
"Store a set from our alphabet"
self.alphabet= set(alphabet)
def missing(self, sequence):
"Return set of missing letters; sequence not necessarily set"
return self.alphabet.difference(sequence)
>>> import string
>>> finder= MissingFinder(string.ascii_lowercase)
>>> finder.missing(string.ascii_lowercase[:5] + string.ascii_lowercase[6:])
>>> set(['f'])
>>> # rinse, repeat calling finder.missing
I'm sure the class and instance names could be improved :)