Remove duplicates from a string so that it appears once - python

Given a string which contains only lowercase letters, remove duplicate letters so that every letter appear once and only once. I must make sure your result is the smallest in lexicographical order among all possible results.
def removeDuplicates(str):
dict = {}
word = []
for i in xrange(len(str)):
if str[i] not in word:
word.append(str[i])
dict[str[i]] = i
else:
word.remove(str[i])
word.append(str[i])
dict[str[i]] = i
ind = dict.values()
# Second scan
for i in xrange(len(str)):
if str.index(str[i]) in ind:
continue
temp = dict[str[i]]
dict[str[i]] = i
lst = sorted(dict.keys(),key = lambda d:dict[d])
if ''.join(lst) < ''.join(word):
word = lst
else:
dict[str[i]] = temp
return ''.join(word)
I am not getting the desired result
print removeDuplicateLetters("cbacdcbc")
Input:
"cbacdcbc"
Output:
"abcd"
Expected:
"acdb"

Use a set. A set is a data structure similar to a list, but it removes all duplicates. You can instantiate a set by doing set(), or setting a variable to a set by using curly brackets. However, this isn't very good for instantiating empty sets, because then Python will think that it's a dictionary. So to achieve what you're doing, you could make the following function:
def removeDuplicates(string):
return ''.join(sorted(set(string)))

Dorian's answer IS the way to go for any practical application, so my addition is mostly toying around.
If a word is really long, it's more efficient to just search whether each letter in the alphabet is in the string and keep only those that are present. Explicitly,
from string import ascii_lowercase
def removeDuplicates(string):
return ''.join(letter for letter in ascii_lowercase if letter in string)
Code to test timings
import random
import timeit
def compare(string, n):
s1 = "''.join(sorted(set('{}')))".format(string)
print timeit.timeit(s1, number=n)
s2 = "from string import ascii_lowercase; ''.join(letter for letter in ascii_lowercase if letter in '{}')".format(string)
print timeit.timeit(s2, number=n)
Tests:
>>> word = 'cbacdcbc'
>>> compare(word, 1000)
0.00385931823843
0.013727678263
>>> word = ''.join(random.choice(ascii_lowercase) for _ in xrange(100000))
>>> compare(word, 1000)
2.21139290323
0.0071371927042
>>> word = 'a'*100000 + ascii_lowercase
>>> compare(word, 1000)
2.20644530225
1.63490857359
This shows that Dorian's answer should perform equally well or even better for small words, even though the speed isn't noticeable by humans. However, for very large strings, this method is much faster. Even for an edge case, where every letter is the same and the rest of the letters can only be found by transversing the whole string it performs better.
Still, Dorian's answer is more elegant and practical.

This is what makes the test succeed.
def removeDuplicates(my_string):
for char in sorted(set(my_string)):
suffix = my_string[my_string.index(char):]
if set(suffix) == set(my_string):
return char + removeDuplicates(suffix.replace(char, ''))
return ''
print removeDuplicates('cbacdcbc')
acdb

Related

Return Alternating Letters With the Same Length From two Strings

there was a similar question asked on here but they wanted the remaining letters returned if one word was longer. I'm trying to return the same number of characters for both strings.
Here's my code:
def one_each(st, dum):
total = ""
for i in (st, dm):
total += i
return total
x = one_each("bofa", "BOFAAAA")
print(x)
It doesn't work but I'm trying to get this desired output:
>>>bBoOfFaA
How would I go about solving this? Thank you!
str.join with zip is possible, since zip only iterates pairwise up to the shortest iterable. You can combine with itertools.chain to flatten an iterable of tuples:
from itertools import chain
def one_each(st, dum):
return ''.join(chain.from_iterable(zip(st, dum)))
x = one_each("bofa", "BOFAAAA")
print(x)
bBoOfFaA
I'd probably do something like this
s1 = "abc"
s2 = "123"
ret = "".join(a+b for a,b in zip(s1, s2))
print (ret)
Here's a short way of doing it.
def one_each(short, long):
if len(short) > len(long):
short, long = long, short # Swap if the input is in incorrect order
index = 0
new_string = ""
for character in short:
new_string += character + long[index]
index += 1
return new_string
x = one_each("bofa", "BOFAAAA") # returns bBoOfFaA
print(x)
It might show wrong results when you enter x = one_each("abcdefghij", "ABCD") i.e when the small letters are longer than capital letters, but that can be easily fixed if you alter the case of each letter of the output.

How to index list when there is 2 of the same char

I have been trying to make the even letters in a string become upper-cased and the odd letters to become lower-cased with a function, like so:
def myfunc('apple'):
#OUTPUTS: 'ApPlE'
This is what I made:
def myfunc(mystring):
stringList = [letter for letter in mystring]
for letter in stringList[1::2]:
stringList[stringList.index(letter)] = letter.lower()
for letter in stringList[::2]:
stringList[stringList.index(letter)] = letter.upper()
return ''.join(stringList)
I believe that, when I use words like 'apple' where there is two identical letters, the index() function can only manage to give me the index of the first 'p', if my word is apple.
It returns:
'APplE'
How could I fix this?
By iterating over the indices of the string, using the built-in function enumerate, together with the characters of the string (strings are also iterable):
def myfunc(mystring):
out = []
for i, c in enumerate(mystring):
if i % 2 == 0:
out.append(c.upper())
else:
out.append(c.lower())
return "".join(out)
Example output:
>>> myfunc('apple')
'ApPlE'
This is also a lot more efficient, since it only iterates over the string once. Your code iterates many times (each stringList.index call does a linear search for the letter).
If you want to make it a bit harder to read but re-use a bit more of what you already have, you can also use this, but I would not recommend it (as it iterates three times over the string, once to build the list and then twice to replace the characters):
def myfunc(mystring):
stringList = list(mystring)
stringList[::2] = map(str.upper, stringList[::2])
stringList[1::2] = map(str.lower, stringList[1::2])
return "".join(stringList)
The method list.index returns the index of the first occurence, making it unfit for recovering the index of the current element. Instead, you should use enumerate, this will allow you to get the expected result with a single list-comprehension.
def myFunc(s):
return ''.join([c.lower() if i % 2 else c.upper() for i, c in enumerate(s)])
print(myFunc('apple')) # ApPlE

Finding regular expression with at least one repetition of each letter

From any *.fasta DNA sequence (only 'ACTG' characters) I must find all sequences which contain at least one repetition of each letter.
For examle from sequence 'AAGTCCTAG' I should be able to find: 'AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG' and 'CTAG' (iteration on each letter).
I have no clue how to do that in pyhton 2.7. I was trying with regular expressions but it was not searching for every variants.
How can I achive that?
You could find all substrings of length 4+, and then down select from those to find only the shortest possible combinations that contain one of each letter:
s = 'AAGTCCTAG'
def get_shortest(s):
l, b = len(s), set('ATCG')
options = [s[i:j+1] for i in range(l) for j in range(i,l) if (j+1)-i > 3]
return [i for i in options if len(set(i) & b) == 4 and (set(i) != set(i[:-1]))]
print(get_shortest(s))
Output:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
This is another way you can do it. Maybe not as fast and nice as chrisz answere. But maybe a little simpler to read and understand for beginners.
DNA='AAGTCCTAG'
toSave=[]
for i in range(len(DNA)):
letters=['A','G','T','C']
j=i
seq=[]
while len(letters)>0 and j<(len(DNA)):
seq.append(DNA[j])
try:
letters.remove(DNA[j])
except:
pass
j+=1
if len(letters)==0:
toSave.append(seq)
print(toSave)
Since the substring you are looking for may be of about any length, a LIFO queue seems to work. Append each letter at a time, check if there are at least one of each letters. If found return it. Then remove letters at the front and keep checking until no longer valid.
def find_agtc_seq(seq_in):
chars = 'AGTC'
cur_str = []
for ch in seq_in:
cur_str.append(ch)
while all(map(cur_str.count,chars)):
yield("".join(cur_str))
cur_str.pop(0)
seq = 'AAGTCCTAG'
for substr in find_agtc_seq(seq):
print(substr)
That seems to result in the substrings you are looking for:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
I really wanted to create a short answer for this, so this is what I came up with!
See code in use here
s = 'AAGTCCTAG'
d = 'ACGT'
c = len(d)
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
print(x)
s,c = s[1:],len(d)
It works as follows:
c is set to the length of the string of characters we are ensuring exist in the string (d = ACGT)
The while loop iterates over each possible substring of s such that c is smaller than the length of s.
This works by increasing c by 1 upon each iteration of the while loop.
If every character in our string d (ACGT) exist in the substring, we print the result, reset c to its default value and slice the string by 1 character from the start.
The loop continues until the string s is shorter than d
Result:
AAGTC
AGTC
GTCCTA
TCCTAG
CCTAG
CTAG
To get the output in a list instead (see code in use here):
s = 'AAGTCCTAG'
d = 'ACGT'
c,r = len(d),[]
while c <= len(s):
x,c = s[:c],c+1
if all(l in x for l in d):
r.append(x)
s,c = s[1:],len(d)
print(r)
Result:
['AAGTC', 'AGTC', 'GTCCTA', 'TCCTAG', 'CCTAG', 'CTAG']
If you can break the sequence into a list, e.g. of 5-letter sequences, you could then use this function to find repeated sequences.
from itertools import groupby
import numpy as np
def find_repeats(input_list, n_repeats):
flagged_items = []
for item in input_list:
# Create itertools.groupby object
groups = groupby(str(item))
# Create list of tuples: (digit, number of repeats)
result = [(label, sum(1 for _ in group)) for label, group in groups]
# Extract just number of repeats
char_lens = np.array([x[1] for x in result])
# Append to flagged items
if any(char_lens >= n_repeats):
flagged_items.append(item)
# Return flagged items
return flagged_items
#--------------------------------------
test_list = ['aatcg', 'ctagg', 'catcg']
find_repeats(test_list, n_repeats=2) # Returns ['aatcg', 'ctagg']

How to split string everywhere a letter appears?

I have a string containing letters and numbers like this -
12345A6789B12345C
How can I get a list that looks like this
[12345A, 6789B, 12345C]
>>> my_string = '12345A6789B12345C'
>>> import re
>>> re.findall('\d*\w', my_string)
['12345A', '6789B', '12345C']
For the sake of completeness, non-regex solution:
data = "12345A6789B12345C"
result = [""]
for char in data:
result[-1] += char
if char.isalpha():
result.append("")
if not result[-1]:
result.pop()
print(result)
# ['12345A', '6789B', '12345C']
Should be faster for smaller strings, but if you're working with huge data go with regex as once compiled and warmed up, the search separation happens on the 'fast' C side.
You could build this with a generator, too. The approach below keeps track of start and end indices of each slice, yielding a generator of strings. You'll have to cast it to list to use it as one, though (splitonalpha(some_string)[-1] will fail, since generators aren't indexable)
def splitonalpha(s):
start = 0
for end, ch in enumerate(s, start=1):
if ch.isalpha:
yield s[start:end]
start = end
list(splitonalpha("12345A6789B12345C"))
# ['12345A', '6789B', '12345C']

Removing a character in a string one at a time

Basically I want to remove a character in a string one at a time if it occurs multiple times .
For eg :- if I have a word abaccea and character 'a' then the output of the function should be baccea , abacce , abccea.
I read that I can make maketrans for a and empty string but it replaces every a in the string.
Is there an efficient way to do this besides noting all the positions in a list and then replacing and generating the words ??
Here is a quick way of doing it:
In [6]: s = "abaccea"
In [9]: [s[:key] + s[key+1:] for key,val in enumerate(s) if val == "a"]
Out[10]: ['baccea', 'abccea', 'abacce']
There is the benefit of being able to turn this into a generator by simpling replacing square brackets with round ones.
You could try the following script. It provides a simple function to do what you ask. The use of list comprehensions [x for x in y if something(x)] is well worth learning.
#!/usr/bin/python
word = "abaccea"
letter = "a"
def single_remove(word, letter):
"""Remove character c from text t one at a time
"""
indexes = [c for c in xrange(len(word)) if word[c] == letter]
return [word[:i] + word[i + 1:] for i in indexes]
print single_remove(word, letter)
returns ['baccea', 'abccea', 'abacce']
Cheers
I'd say that your approach sounds good - it is a reasonably efficient way to do it and it will be clear to the reader what you are doing.
However a slightly less elegant but possibly faster alternative is to use the start parameter of the find function.
i = 0
while True:
j = word.find('a', i)
if j == -1:
break
print word[:j] + word[j+1:]
i = j + 1
The find function is likely to be highly optimized in C, so this may give you a performance improvement compared to iterating over the characters in the string yourself in Python. Whether you want to do this though depends on whether you are looking for efficiency or elegance. I'd recommend going for the simple and clear approach first, and only optimizing it if performance profiling shows that efficiency is an important issue.
Here are some performance measurements showing that the code using find can run faster:
>>> method1='[s[:key] + s[key+1:] for key,val in enumerate(s) if val == "a"]'
>>> method2='''
result=[]
i = 0
while True:
j = s.find('a', i)
if j == -1:
break
result.append(s[:j] + s[j+1:])
i = j + 1
'''
>>> timeit.timeit(method1, init, number=100000)
2.5391986271997666
>>> timeit.timeit(method2, init, number=100000)
1.1471052885212885
how about this ?
>>> def replace_a(word):
... word = word[1:8]
... return word
...
>>> replace_a("abaccea")
'baccea'
>>>

Categories