Find "one letter that appears twice" in a string - python

I'm trying to catch if one letter that appears twice in a string using RegEx (or maybe there's some better ways?), for example my string is:
ugknbfddgicrmopn
The output would be:
dd
However, I've tried something like:
re.findall('[a-z]{2}', 'ugknbfddgicrmopn')
but in this case, it returns:
['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn'] # the except output is `['dd']`
I also have a way to get the expect output:
>>> l = []
>>> tmp = None
>>> for i in 'ugknbfddgicrmopn':
... if tmp != i:
... tmp = i
... continue
... l.append(i*2)
...
...
>>> l
['dd']
>>>
But that's too complex...
If it's 'abbbcppq', then only catch:
abbbcppq
^^ ^^
So the output is:
['bb', 'pp']
Then, if it's 'abbbbcppq', catch bb twice:
abbbbcppq
^^^^ ^^
So the output is:
['bb', 'bb', 'pp']

You need use capturing group based regex and define your regex as raw string.
>>> re.search(r'([a-z])\1', 'ugknbfddgicrmopn').group()
'dd'
>>> [i+i for i in re.findall(r'([a-z])\1', 'abbbbcppq')]
['bb', 'bb', 'pp']
or
>>> [i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]
['bb', 'bb', 'pp']
Note that , re.findall here should return the list of tuples with the characters which are matched by the first group as first element and the second group as second element. For our case chars within first group would be enough so I mentioned i[0].

As a Pythonic way You can use zip function within a list comprehension:
>>> s = 'abbbcppq'
>>>
>>> [i+j for i,j in zip(s,s[1:]) if i==j]
['bb', 'bb', 'pp']
If you are dealing with large string you can use iter() function to convert the string to an iterator and use itertols.tee() to create two independent iterator, then by calling the next function on second iterator consume the first item and use call the zip class (in Python 2.X use itertools.izip() which returns an iterator) with this iterators.
>>> from itertools import tee
>>> first = iter(s)
>>> second, first = tee(first)
>>> next(second)
'a'
>>> [i+j for i,j in zip(first,second) if i==j]
['bb', 'bb', 'pp']
Benchmark with RegEx recipe:
# ZIP
~ $ python -m timeit --setup "s='abbbcppq'" "[i+j for i,j in zip(s,s[1:]) if i==j]"
1000000 loops, best of 3: 1.56 usec per loop
# REGEX
~ $ python -m timeit --setup "s='abbbcppq';import re" "[i[0] for i in re.findall(r'(([a-z])\2)', 'abbbbcppq')]"
100000 loops, best of 3: 3.21 usec per loop
After your last edit as mentioned in comment if you want to only match one pair of b in strings like "abbbcppq" you can use finditer() which returns an iterator of matched objects, and extract the result with group() method:
>>> import re
>>>
>>> s = "abbbcppq"
>>> [item.group(0) for item in re.finditer(r'([a-z])\1',s,re.I)]
['bb', 'pp']
Note that re.I is the IGNORECASE flag which makes the RegEx match the uppercase letters too.

Using back reference, it is very easy:
import re
p = re.compile(ur'([a-z])\1{1,}')
re.findall(p, u"ugknbfddgicrmopn")
#output: [u'd']
re.findall(p,"abbbcppq")
#output: ['b', 'p']
For more details, you can refer to a similar question in perl: Regular expression to match any character being repeated more than 10 times

It is pretty easy without regular expressions:
In [4]: [k for k, v in collections.Counter("abracadabra").items() if v==2]
Out[4]: ['b', 'r']

Maybe you can use the generator to achieve this
def adj(s):
last_c = None
for c in s:
if c == last_c:
yield c * 2
last_c = c
s = 'ugknbfddgicrmopn'
v = [x for x in adj(s)]
print(v)
# output: ['dd']

"or maybe there's some better ways"
Since regex is often misunderstood by the next developer to encounter your code (may even be you),
And since simpler != shorter,
How about the following pseudo-code:
function findMultipleLetters(inputString) {
foreach (letter in inputString) {
dictionaryOfLettersOccurrance[letter]++;
if (dictionaryOfLettersOccurrance[letter] == 2) {
multipleLetters.add(letter);
}
}
return multipleLetters;
}
multipleLetters = findMultipleLetters("ugknbfddgicrmopn");

A1 = "abcdededdssffffccfxx"
print A1[1]
for i in range(len(A1)-1):
if A1[i+1] == A1[i]:
if not A1[i+1] == A1[i-1]:
print A1[i] *2

>>> l = ['ug', 'kn', 'bf', 'dd', 'gi', 'cr', 'mo', 'pn']
>>> import re
>>> newList = [item for item in l if re.search(r"([a-z]{1})\1", item)]
>>> newList
['dd']

Related

Python: Check for unique characters on a String

I'm asking the user to input a keyword and then remove any duplicate characters.
Example:
Input: balloon
Output: balon
I've tried this solution: List of all unique characters in a string? but it marks it as a syntax error.
Any ideas?
Try:
In [4]: from collections import OrderedDict
In [5]: def rawInputTest():
...: x = raw_input(">>> Input: ")
...: print ''.join(OrderedDict.fromkeys(x).keys())
In [6]: rawInputTest()
>>> Input: balloon
balon
For your answer, order is important. Here is a one line solution:
word = raw_input()
reduced_word = ''.join(
[char for index, char in enumerate(word) if char not in word[0:index]])
You can use OrderedDict:
In [15]: from collections import OrderedDict
In [16]: s="ballon"
In [17]: od = OrderedDict.fromkeys(s).keys()
In [18]: print(od)
['b', 'a', 'l', 'o', 'n']
In [19]: "".join(od)
Out[19]: 'balon'
You are performing a set operation. If order is not import, just insert each character into a set and print the contents of the set when done.
If the order is important, then create a set. For each character, if it is not in the set append it to a list and insert it into the set. Print the contents of the list in order when done.
dict1 = {"cat":6,"rat":7,"dog":4,"elephent":20,"lion":15,"cow":9,"bat":4,"hen":2,"bird":2}
dic_values= dict1.values()
unique_val = set(dic_values)
print("dic_val :",dic_values)
print("unique_dict_valu :",unique_val)

Splitting strings in python, then joining them into so that each string is one substring longer than the next

Basically what I want to do is take a string like this:
-o-pp-gg-s-h
then turn it into the series of strings:
-o
-o-pp
-o-pp-gg
-o-pp-gg-s
-o-pp-gg-s-h
I know that I could do this by splitting the string (str.split('-')), then having a loop that joins the substrings to produce that output ('-'.join(lst)). However, is there a more elegant way to do this in Python?
List comprehension!
s = "-o-pp-gg-s-h"
ss = s.split("-")
series = ["-".join(ss[:x]) for x in range(2,len(ss)+1)]
Is this elegant enough?
>>> s='-o-pp-gg-s-h'
>>> nlist=s.split('-')
>>> for i in range(len(nlist)):
... print '-'.join(nlist[:i])
...
-o
-o-pp
-o-pp-gg
-o-pp-gg-s
>>>
>>> a
'-o-pp-gg-s-h'
>>> b=a.split('-')
>>> b
['', 'o', 'pp', 'gg', 's', 'h']
>>> for i in range(len(b)+1):
... print '-'.join(b[0:i])
-o
-o-pp
-o-pp-gg
-o-pp-gg-s
-o-pp-gg-s-h
Only for variety, using accumulate (in modern Python):
In [23]: s = '-o-pp-gg-s-h'
In [24]: from itertools import accumulate
In [25]: list(accumulate(s.split('-'), lambda x,y: x+'-'+y))[1:]
Out[25]: ['-o', '-o-pp', '-o-pp-gg', '-o-pp-gg-s', '-o-pp-gg-s-h']

modifying part of a list in place using list comprehensions in python

I have a list that looks like
test = ['A','B','C','D D','E E','F F']
I would like test to become the following (that is, the spaces removed)
test = ['A', 'B', 'C', 'DD', 'EE', 'FF']
I used a list comprehension in Python to achieve this:
>>> [re.sub(' ','',i) for i in test]
['A', 'B', 'C', 'DD', 'EE', 'FF']
My question is - what if I explicitly DO NOT want re.sub(' ','',i) to run on the first three elements of my list? I only want the re.sub function to run on 'DD','EE', and 'FF'.
Is this way efficient? I understand a list comprehension takes up memory because Python makes a copy.
test2[3:] = [re.sub(' ','',i) for i in test[3:]]
Or should I just loop through the values of test that I want to modify like this:
for i in range(3,len(test)):
print i
test[i] = re.sub(' ','',test[i])
First of all, it sounds like you're optimizing prematurely.
Secondly, you can express your requirements with a single list comprehension:
In [5]: test = ['A','B','C','D D','E E','F F']
In [6]: [t if i < 3 else re.sub(' ', '', t) for (i, t) in enumerate(test)]
Out[6]: ['A', 'B', 'C', 'DD', 'EE', 'FF']
Finally, my advice would be to focus on correctness first, then on readability. Once you've achieved those, profile the code to see where the bottlenecks are, and only then optimize for performance.
The best of re.sub, str.replace and str.translate is the str.replace. So, use str.replace
Here is a little timing comparison.
import re
def test1():
test = ['A','B','C','D D','E E','F F']
test[3:] = [re.sub(' ','',i) for i in test[3:]]
def test2():
test = ['A','B','C','D D','E E','F F']
test[3:] = [i.replace(" ", "") for i in test[3:]]
def test3():
test = ['A','B','C','D D','E E','F F']
test[3:] = [item.translate(None, " ") for item in test[3:]]
from timeit import timeit
print timeit("test1()", "from __main__ import test1")
print timeit("test2()", "from __main__ import test2")
print timeit("test3()", "from __main__ import test3")
Output on my machine
3.96201109886
0.985305070877
1.11600804329
Note: As #roippi mentioned in the comments, str.translate will not work in this form in Python 3.x. So, ignore that in the race, if you are using Python 3.x
My question is - what if I explicitly DO NOT want re.sub(' ','',i) to
run on the first three elements of my list?
Okay, answering that question first:
You can use enumerate and a conditional expression to specify the behavior you want for i < 3 and i >= 3:
[x if i<3 else re.sub(' ','',x) for i,x in enumerate(test)]
['A', 'B', 'C', 'DD', 'EE', 'FF']
Note that this simple sub operation can be handled more straightforwardly by str.replace.
(I will leave out discussion of whether this sort of optimization is worthwhile, other than saying the time saved by not doing re.sub on the first three elements is miniscule)

Multiple-replace in python

I do the following for replacing.
import fileinput
for line in fileinput.FileInput("input.txt",inplace=1):
line = line.replace("A","A'")
print line,
But I want to do it many replaces.
Replace A with A' , B with BB, C with CX, D with KK, etc.
I can of course do this by repeating the above code many times.
But I guess that will consume a lot of time especially when input.txt is large.
How can I do this elegantly?
Emphasis added
My input is not just a str ABCD.
I need to use input.txt as input and I want to replace every occurrences of A in input.txt to A', every occurrences in input.txt of B to BB, every occurrences of C in input.txt to CX, every occurrences of D in input.txt to KK.
Use a mapping dictionary:
>>> map_dict = {'A':"A'", 'B':'BB', 'C':'CX', 'D':'KK'}
>>> strs = 'ABCDEF'
>>> ''.join(map_dict.get(c,c) for c in strs)
"A'BBCXKKEF"
In Python3 use str.translate instead of str.join:
>>> map_dict = {ord('A'):"A'", ord('B'):'BB', ord('C'):'CX', ord('D'):'KK'}
>>> strs = 'ABCDEF'
>>> strs.translate(map_dict)
"A'BBCXKKEF"
Using regular expression:
>>> import re
>>>
>>> replace_map = {
... 'A': "A'",
... 'B': 'BB',
... 'C': 'CX',
... 'D': 'KK',
... 'EFG': '.',
... }
>>> pattern = '|'.join(map(re.escape, replace_map))
>>> re.sub(pattern, lambda m: replace_map[m.group()], 'ABCDEFG')
"A'BBCXKK."

How to convert a string with comma-delimited items to a list in Python?

How do you convert a string into a list?
Say the string is like text = "a,b,c". After the conversion, text == ['a', 'b', 'c'] and hopefully text[0] == 'a', text[1] == 'b'?
Like this:
>>> text = 'a,b,c'
>>> text = text.split(',')
>>> text
[ 'a', 'b', 'c' ]
Just to add on to the existing answers: hopefully, you'll encounter something more like this in the future:
>>> word = 'abc'
>>> L = list(word)
>>> L
['a', 'b', 'c']
>>> ''.join(L)
'abc'
But what you're dealing with right now, go with #Cameron's answer.
>>> word = 'a,b,c'
>>> L = word.split(',')
>>> L
['a', 'b', 'c']
>>> ','.join(L)
'a,b,c'
The following Python code will turn your string into a list of strings:
import ast
teststr = "['aaa','bbb','ccc']"
testarray = ast.literal_eval(teststr)
I don't think you need to
In python you seldom need to convert a string to a list, because strings and lists are very similar
Changing the type
If you really have a string which should be a character array, do this:
In [1]: x = "foobar"
In [2]: list(x)
Out[2]: ['f', 'o', 'o', 'b', 'a', 'r']
Not changing the type
Note that Strings are very much like lists in python
Strings have accessors, like lists
In [3]: x[0]
Out[3]: 'f'
Strings are iterable, like lists
In [4]: for i in range(len(x)):
...: print x[i]
...:
f
o
o
b
a
r
TLDR
Strings are lists. Almost.
In case you want to split by spaces, you can just use .split():
a = 'mary had a little lamb'
z = a.split()
print z
Output:
['mary', 'had', 'a', 'little', 'lamb']
If you actually want arrays:
>>> from array import array
>>> text = "a,b,c"
>>> text = text.replace(',', '')
>>> myarray = array('c', text)
>>> myarray
array('c', 'abc')
>>> myarray[0]
'a'
>>> myarray[1]
'b'
If you do not need arrays, and only want to look by index at your characters, remember a string is an iterable, just like a list except the fact that it is immutable:
>>> text = "a,b,c"
>>> text = text.replace(',', '')
>>> text[0]
'a'
m = '[[1,2,3],[4,5,6],[7,8,9]]'
m= eval(m.split()[0])
[[1, 2, 3], [4, 5, 6], [7, 8, 9]]
All answers are good, there is another way of doing, which is list comprehension, see the solution below.
u = "UUUDDD"
lst = [x for x in u]
for comma separated list do the following
u = "U,U,U,D,D,D"
lst = [x for x in u.split(',')]
I usually use:
l = [ word.strip() for word in text.split(',') ]
the strip remove spaces around words.
To convert a string having the form a="[[1, 3], [2, -6]]" I wrote yet not optimized code:
matrixAr = []
mystring = "[[1, 3], [2, -4], [19, -15]]"
b=mystring.replace("[[","").replace("]]","") # to remove head [[ and tail ]]
for line in b.split('], ['):
row =list(map(int,line.split(','))) #map = to convert the number from string (some has also space ) to integer
matrixAr.append(row)
print matrixAr
split() is your friend here. I will cover a few aspects of split() that are not covered by other answers.
If no arguments are passed to split(), it would split the string based on whitespace characters (space, tab, and newline). Leading and trailing whitespace is ignored. Also, consecutive whitespaces are treated as a single delimiter.
Example:
>>> " \t\t\none two three\t\t\tfour\nfive\n\n".split()
['one', 'two', 'three', 'four', 'five']
When a single character delimiter is passed, split() behaves quite differently from its default behavior. In this case, leading/trailing delimiters are not ignored, repeating delimiters are not "coalesced" into one either.
Example:
>>> ",,one,two,three,,\n four\tfive".split(',')
['', '', 'one', 'two', 'three', '', '\n four\tfive']
So, if stripping of whitespaces is desired while splitting a string based on a non-whitespace delimiter, use this construct:
words = [item.strip() for item in string.split(',')]
When a multi-character string is passed as the delimiter, it is taken as a single delimiter and not as a character class or a set of delimiters.
Example:
>>> "one,two,three,,four".split(',,')
['one,two,three', 'four']
To coalesce multiple delimiters into one, you would need to use re.split(regex, string) approach. See the related posts below.
Related
string.split() - Python documentation
re.split() - Python documentation
Split string based on regex
Split string based on a regular expression
# to strip `,` and `.` from a string ->
>>> 'a,b,c.'.translate(None, ',.')
'abc'
You should use the built-in translate method for strings.
Type help('abc'.translate) at Python shell for more info.
Using functional Python:
text=filter(lambda x:x!=',',map(str,text))
Example 1
>>> email= "myemailid#gmail.com"
>>> email.split()
#OUTPUT
["myemailid#gmail.com"]
Example 2
>>> email= "myemailid#gmail.com, someonsemailid#gmail.com"
>>> email.split(',')
#OUTPUT
["myemailid#gmail.com", "someonsemailid#gmail.com"]

Categories