Find all indices of each letter in a string - python

I'm trying to get a list consisting of the indexes for each item of another sequence.
Sounds easy enough in theory.
a = 'string of letters'
b = [a.index(x) for x in a]
But it doesn't work. I've tried list comprehensions, simple for loops, using enumerate etc, but every time b will return the same index for duplicates in a.
That is, 's' in a, for example, will return '0' in b for both the first and last item because they're the same character.
I'm guessing is cache or something like that as a way for Python to speed things up.
In any case, I can't figure this out and I'd appreciate some help as to how I can get this working as well as maybe an explanation of why this happens.

Thanks a lot for the input. I did figure it out with enumerate, actually.
To elaborate, I had two lists, a and b. a contains both uppercase and lowercase characters. b consists of the same characters as a, but shifted by a certain number of positions, like in a cipher.
I wanted to keep the case of the characters in b at the same position, after the 'encoding', but I needed the index of each character in 'A'.
Anyway, it was as simple as this:
a = 'tEXt'
c = [x for x,y in enumerate(a) if y.isupper()]
b = ['x', 't', 't', 'e'] #(this is the encoded version of 'a', returned from a different place as a string, but converted here to a list)
for x in c:
b[x] = b[x].upper()
b = ''.join[b]
b
'xTTe'

.index just returns the first occurrence of a character in a string - this has nothing to do with caches. It seems like you just want the list of numbers from 0 until your string length-1:
b = list(range(len(a)))
You do not mention why you need this, but it's pretty rare to need something like this in Python. Note in Python 3 range returns a a special type of it's own representing an immutable sequence of numbers, so you need to explicitly convert it to a list if you do actually need that.

I refactored the code you posted as an answer, let me know if I understood things correctly.
from typing import List
def copy_case(a: str, b: str) -> str:
res_chars: List[str] = []
curr_a: str
curr_b: str
for curr_a, curr_b in zip(a, b):
if curr_a.isupper():
curr_b = curr_b.upper()
else:
curr_b = curr_b.lower()
res_chars.append(curr_b)
return ''.join(res_chars)
print(copy_case('tEXt', 'xTTe'))

One approach could be to build a dictionary, iterating over the distinct letters in the string and using re.finditer to obtain the index of all occurrences in the string. So going step by step:
import re
a = 'string of letters'
We can find the unique letters in the string by taking a set:
letters = set(a.replace(' ',''))
# {'e', 'f', 'g', 'i', 'l', 'n', 'o', 'r', 's', 't'}
Then we could use a dictionary comprehension to build the dictionary, in which the the values are a list generated by iterating over all match instances returned by re.finditer:
{w: [m.start() for m in re.finditer(w, a)] for w in letters}
{'i': [3],
'o': [7],
'f': [8],
'l': [10],
'g': [5],
'e': [11, 14],
't': [1, 12, 13],
's': [0, 16],
'n': [4],
'r': [2, 15]}

A dict is probably better than a list for this purpose:
foo = {x : [] for x in a} #creates dict with keys being unique values in a
for i,x in enumerate(a):
foo[x].append(i) #adds each index into dict
for example for string 'abababababa':
{'a': [0, 2, 4, 6, 8], 'b': [1, 3, 5, 7, 9]}

Sounds like you're trying to get a list of the indeces of each input char as
an output. So, for s, you would get [0, 16], or something along those lines.
So for each input char, you would add its position to the right list.
Storing the results in a dict seems like a good approach, so, something like:
def index_dict(stringy):
d = {}
for index, char in enumerate(stringy):
if char not in d:
d[char] = []
d[char].append(index)
return d
The index() method always finds the first occurrence. You need to find all occurrences. So, the above func will give you a dict with all the keys matching the chars of your input string, and then the value for each key is a list of indeces where that char is found.

Related

IndexError: list index out of range due to empty list within list

I have a list of a group of lists inside and I'm trying to create a dictionary by assigning each element in a group as a value to a key. However, one of these groups is just an empty list. I tried to use the filter() function to eliminate that empty list and I also tried using the remove() function, however, none of those work. It results in the following error:
my_dict = {'letter': g[0], 'my_arr': g[1], 'second_letter_conf': g[2]}
IndexError: list index out of range
This is what I have tried:
import numpy as np
my_list = [['A', np.array([4, 2, 1, 6]), [['B', 5]]], [' '], ['C', np.array([8, 5, 5, 9]), [['D', 3]]]]
# my_list = list(filter(None, my_list)) # does not work
for g in my_list:
# if g == [' ']:
# my_list.remove(g) # does not work
my_dict = {'letter': g[0], 'my_arr': g[1], 'second_letter_conf': g[2]}
Where am I going wrong? How do I eliminate that one empty list from my_list?
You can't mutate a list as you iterate it safely. But you can just skip the element and move to the next; you were quite close really:
for g in my_list:
if g == [' ']: # I might suggest instead doing if len(g) != 3:
# if any length 3thing is good, and other lengths should be discarded
continue # Skip this element, go to next
my_dict = {'letter': g[0], 'my_arr': g[1], 'second_letter_conf': g[2]}
If you need to exclude it from my_list (you'll be using it over and over and don't want to have it there in the future), a quick pre-filter with a listcomp is the simplest solution (and if it lets you assume length three data for what remains, it will allow you to unpack for cleaner/more-self-documenting code in subsequent use):
my_list = [x for x in my_list if x != [' ']]
for let, arr, let2_conf in my_list:
my_dict = {'letter': let, 'my_arr': arr, 'second_letter_conf': let2_conf}

python list comprehension with cls

I encountered a snippet of code like the following:
array = ['a', 'b', 'c']
ids = [array.index(cls.lower()) for cls in array]
I'm confusing for two points:
what does [... for cls in array] mean, since cls is a reserved keyword for class, why not just using [... for s in array]?
why bother to write something complicated like this instead of just [i for i in range(len(array))].
I believe this code is written by someone more experienced with python than me, and I believe he must have some reason for doing so...
cls is not a reserved word for class. That would be a very poor choice of name by the language designer. Many programmers may use it by convention but it is no more reserved than the parameter name self.
If you use distinct upper and lower case characters in the list, you will see the difference:
array = ['a', 'b', 'c', 'B','A','c']
ids = [array.index(cls.lower()) for cls in array]
print(ids)
[0, 1, 2, 1, 0, 2]
The value at position 3 is 1 instead of 3 because the first occurrence of a lowercase 'B' is at index 1. Similarly, the value at the last positions is 2 instead of 5 because the first 'c' is at index 2.
This list comprehension requires that the array always contain a lowercase instance of every uppercase letter. For example ['a', 'B', 'c'] would make it crash. Hopefully there are other safeguards in the rest of the program to ensure that this requirement is always met.
A safer, and more efficient way to write this would be to build a dictionary of character positions before going through the array to get indexes. This would make the time complexity O(n) instead of O(n^2). It could also help make the process more robust.
array = ['a', 'b', 'c', 'B','A','c','Z']
firstchar = {c:-i for i,c in enumerate(array[::-1],1-len(array))}
ids = [firstchar.get(c.lower()) for c in array]
print(ids)
[0, 1, 2, 1, 0, 2, None]
The firstchar dictionary contains the first index in array containing a given letter. It is built by going backward through the array so that the smallest index remains when there are multiple occurrences of the same letter.
{'Z': 6, 'c': 2, 'A': 4, 'B': 3, 'b': 1, 'a': 0}
Then, going through the array to form ids, each character finds the corresponding index in O(1) time by using the dictionary.
Using the .get() method allows the list comprehension to survive an upper case letter without a corresponding lowercase value in the list. In this example it returns None but it could also be made to return the letter's index or the index of the first uppercase instance.
Some developers might be experienced, but actually terrible with the code they write and just "skate on by".
Having said that, your suggested output for question #2 would differ if the list contained two of any element. The suggested code would return the first indices where a list element occurs where as yours would give each individual items index. It would also differ if the array elements weren't lowercase.

Permutations using a multidict

I'm trying to put together a code that replaces unique characters in a given input string with corresponding values in a dictionary in a combinatorial manner while preserving the position of 'non' unique characters.
For example, I have the following dictionary:
d = {'R':['A','G'], 'Y':['C','T']}
How would go about replacing all instances of 'R' and 'Y' while producing all possible combinations of the string but maintaining the positions of 'A' and 'C'?
For instance, the input 'ARCY' would generate the following output:
'AACC'
'AGCC'
'AACT'
'AGCT'
Hopefully that makes sense. If anyone can point me in the right directions, that would be great!
Given the dictionary, we can state a rule that tells us what letters are possible at a given position in the output. If the original letter from the input is in the dictionary, we use the value; otherwise, there is a single possibility - the original letter itself. We can express that very neatly:
def candidates(letter):
d = {'R':['A','G'], 'Y':['C','T']}
return d.get(letter, [letter])
Knowing the candidates for each letter (which we can get by mapping our candidates function onto the letters in the pattern), we can create the Cartesian product of candidates, and collapse each result (which is a tuple of single-letter strings) into a single string by simply ''.joining them.
def substitute(pattern):
return [
''.join(result)
for result in itertools.product(*map(candidates, pattern))
]
Let's test it:
>>> substitute('ARCY')
['AACC', 'AACT', 'AGCC', 'AGCT']
The following generator function produces all of your desired strings, using enumerate, zip, itertools.product, a list comprehension and argument list unpacking all of which are very handy Python tools/concepts you should read up on:
from itertools import product
def multi_replace(s, d):
indexes, replacements = zip(*[(i, d[c]) for i, c in enumerate(s) if c in d])
# indexes: (1, 3)
# replacements: (['A', 'G'], ['C', 'T'])
l = list(s) # turn s into sth. mutable
# iterate over cartesian product of all replacement tuples ...
for p in product(*replacements):
for index, replacement in zip(indexes, p):
l[index] = replacement
yield ''.join(l)
d = {'R': ['A', 'G'], 'Y': ['C', 'T']}
s = 'ARCY'
for perm in multi_replace(s, d):
print perm
AACC
AACT
AGCC
AGCT
s = 'RRY'
AAC
AAT
AGC
AGT
GAC
GAT
GGC
GGT
Change ARCY to multiple list and use below code:
import itertools as it
list = [['A'], ['A','G'],['C'],['C','T']]
[''.join(item) for item in it.product(*list)]
or
import itertools as it
list = ['A', 'AG','C', 'CT']
[''.join(item) for item in it.product(*list)]

Python: List Dictionary Comprehension

I have the following code:
letters = 'defghijklmno'
K = {letters[i]:(i*i-1) for i in range(len(letters))}
I understand that I'm iterating over the sequence variable of letters and how the value is calculated, but I'm confused over how the key gets set to the individual characters of the string. Especially because I have letters being indexed as my key. Basically, I'm just trying to figure out how python evaluates this expression
That dict comprehension is basically a synonym for:
k = {}
for i in range(len(letters)):
k[letters[i]] = i*i - 1
The difference is that it creates a new scope instead of using the outer scope:
>>> letters = 'defghijklmno'
>>> K = {letters[i]:(i*i-1) for i in range(len(letters))}
>>> i # was defined in an inner scope
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
NameError: name 'i' is not defined
>>> k = {}
>>> for i in range(len(letters)):
... k[letters[i]] = i*i - 1
...
>>> i # still defined!
11
Explanation:
>>> letters = 'defghijklmno'
>>> range(len(letters))
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]
This means, that
>>> [letters[i] for i in range(len(letters))]
['d', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o']
At the same time
>>> [(i*i-1) for i in range(len(letters))]
[-1, 0, 3, 8, 15, 24, 35, 48, 63, 80, 99, 120]
So, your dictionary comprehension builds dict of pairs 'd':-1, 'e':0, 'f':3, ... (etc).
Well, first of all, this is a rather bad way of doing it. Looping by indices is a really bad practice in Python (it's slower, and horrible to read), so the much better way is this:
letters = 'defghijklmno'
K = {letter: (i*i-1) for i, letter in enumerate(letters)}
All this is is a simple dictionary comprehension. When we loop over a string, we get the individual characters making it up. We use the enumerate() builtin to give us matching numbers, and then produce a dictionary from the letter to the number squared, minus one.
If you are struggling with the comprehension itself, it's equivalent to a for loop (except faster), and I recommend you watch my video for a complete explanation with examples of dictionary comprehensions alongside it's cousins (list/set comprehensions and generator expressions).
To understand it, it helps to look at the individual parts of what happens. A for i in range(len(letters)) loop does not loop over the individual characters of the letters, but over the indizes of the string. That is because you can access indidual characters of a string using their index. So letters[0] refers to the first character, letters[1] to the second, and letters[len(letters)-1] to the last.
So, let’s look at the keys of the dictionary individually:
>>> [letters[i] for i in range(len(letters))]
['d', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o']
So you get all the letters individually in the original order.
Now, let’s look at the values of the dictionary:
>>> [(i*i-1) for i in range(len(letters))]
[-1, 0, 3, 8, 15, 24, 35, 48, 63, 80, 99, 120]
So, now we have both keys and values; all that the dictionary comprehension does now is link those keys to the values—in the order above.
The second line is a dictionary comprehension. It is like a normal list comprehension or generator expression except that it generates key value pairs which are then used to form a dictionary.
The code is roughly equivalent to
letters = 'defghijklmno'
K = {}
for i in range(len(letters)):
key = letters[i]
val = (i*i-1)
K[key] = val
You could rewrite the Dict comprehension as loop like
K = {} # empty dict
for i in range(len(letters)): # i goes from 0 to 11
K[letters[i]] = i*i-1
so in the single iterations you have
K['d'] = -1
K['e'] = 0
K['f'] = 3
# ...
and so on. The dict comprehension is just a shorter (and in the opinion of most python programmers) more elegant way to write this loop.
For every i from i == 0 to i == 11 (index of the last letter in letters), an entry is added to the resulting dictionary where the key is letters[i] and its associated value is i*i-1. This gives:
K['d'] == -1
K['e'] == 0
K['f'] == 3
and so on.
You're not actually iterating over the letters of letters, per se; rather, you're iterating over the length of letters, by varying i from 0, to 1, to 2, ..., to 11. As you vary i, you create a dictionary entry whose key is the ith letter of letters and whose value is i*i - 1.
In other words, you create a dictionary, each entry of which consists of a letter (key) k from letters, paired with a value equal to k's index squared, minus 1.
You can read the dictionary comprehension in plain English as: the dictionary of all letters (keys) k from letters with index i, paired with the value i*i - 1.

going through a dictionary and printing its values in sequence

def display_hand(hand):
for letter in hand.keys():
for j in range(hand[letter]):
print letter,
Will return something like: b e h q u w x. This is the desired output.
How can I modify this code to get the output only when the function has finished its loops?
Something like below code causes me problems as I can't get rid of dictionary elements like commas and single quotes when printing the output:
def display_hand(hand):
dispHand = []
for letter in hand.keys():
for j in range(hand[letter]):
##code##
print dispHand
UPDATE
John's answer is very elegant i find. Allow me however to expand o Kugel's response:
Kugel's approach answered my question. However i kept running into an additional issue: the function would always return None as well as the output. Reason: Whenever you don't explicitly return a value from a function in Python, None is implicitly returned. I couldn't find a way to explicitly return the hand. In Kugel's approach i got closer but the hand is still buried in a FOR loop.
You can do this in one line by combining a couple of list comprehensions:
print ' '.join(letter for letter, count in hand.iteritems() for i in range(count))
Let's break that down piece by piece. I'll use a sample dictionary that has a couple of counts greater than 1, to show the repetition part working.
>>> hand
{'h': 3, 'b': 1, 'e': 2}
Get the letters and counts in a form that we can iterate over.
>>> list(hand.iteritems())
[('h', 3), ('b', 1), ('e', 2)]
Now just the letters.
>>> [letter for letter, count in hand.iteritems()]
['h', 'b', 'e']
Repeat each letter count times.
>>> [letter for letter, count in hand.iteritems() for i in range(count)]
['h', 'h', 'h', 'b', 'e', 'e']
Use str.join to join them into one string.
>>> ' '.join(letter for letter, count in hand.iteritems() for i in range(count))
'h h h b e e'
Your ##code perhaps?
dispHand.append(letter)
Update:
To print your list then:
for item in dispHand:
print item,
another option without nested loop
"".join((x+' ') * y for x, y in hand.iteritems()).strip()
Use
" ".join(sequence)
to print a sequence without commas and the enclosing brackets.
If you have integers or other stuff in the sequence
" ".join(str(x) for x in sequence)

Categories