So, I have a dictionary of terms where each key is a word from a text file, and the value is a list of the next two words in that text file.
def dict(txt, n):
txt = txt.split()
output = {}
for i in range(len(txt)-n+1):
t = ' '.join(txt[i:i+1])
p = text[i+1:i+n]
output.setdefault(t, 0)
output[t] = p
return output
The output is a dictionary of things like:
{'had':['a','little'], 'Mary':['had','a'], 'a': ['little', 'lamb.']}
(Mine is actually much longer, as it is analyzing a long paper.)
My question is, how do I join these terms back together by reading the key, and then printing the values, then reading the last value and then finding a key that matches that value. The goal is ultimately to get a randomized paragraph, provided using a large document.
So far, I have something along the lines of:
if output[t] == text[1]:
return output
print(output.join(' ')
But this isn't returning anything. Suggestions?
Python's join does not work like you expect, perhaps.
You are thinking that you write
collection.join(join_character)
but it is
join_character.join(collection)
You should expect to write code like
' '.join(output)
What exactly you need for output is up to you; I expect you can figure that part out. It just looks like you were using join incorrectly here.
This will add terms until dic does not contain key.
dic = {'had': ['a', 'little'], 'Mary': ['had', 'a'], 'a': ['little', 'lamb.']}
key = 'Mary'
res = []
while True:
try:
res.extend(dic[key])
key = dic[key][-1]
except KeyError:
break
print ' '.join(res)
This yields:
['had', 'a', 'little', 'lamb.']
Be aware: You will enter an infinite loop if all values also are a key. You will also encounter this if there is a repeating sequence in your dictionary, such as
{'a': ['b', 'c'], 'b': ['a', 'c'], 'c': ['a', 'b'], 'foo': ['bar', 'foobar']}
To avoid this, you could do one out of two things:
Set a maximum iteration value
Stop the iteration when you encounter a key that previously have been seen.
Maximum iteration value:
dic = {'had': ['a', 'little'], 'Mary': ['had', 'a'], 'a': ['little', 'lamb.']}
key = 'Mary'
res = []
max_iterations = 10
i = 0
while i < max_iterations
try:
res.extend(dic[key])
key = dic[key][-1]
except KeyError:
break
i += 1
if i > max_iterations:
break
print ' '.join(res)
Stop at previously seen key
dic = {'had': ['a', 'little'], 'Mary': ['had', 'a'], 'a': ['little', 'lamb.']}
key = 'Mary'
res = []
seen_keys = []
while True:
if key in seen_keys:
break
try:
res.extend(dic[key])
seen_keys.append(key)
key = dic[key][-1]
except KeyError:
break
print ' '.join(res)
Your data structure dict['word1'] = ('word2', 'word3') requires you to search into the data values, which is very inefficient.
It would be much easier to look up if it was organized as dict[('word1', 'word2')] = ['possible', 'word3', 'values'].
from itertools import tee, izip
from collections import defaultdict
from random import choice
def triwise(iterable):
a,b,c = tee(iter(iterable), 3)
next(b, None)
next(c, None)
next(c, None)
return izip(a, b, c)
def make_lookup(txt):
res = defaultdict(list)
words = ['', ''] + txt.strip().split() + ['', '']
for w1, w2, w3 in triwise(words):
res[(w1, w2)].append(w3)
return dict(res)
def make_sentence(lookup):
w1, w2 = '', ''
words = []
while True:
w1, w2 = w2, choice(lookup[(w1, w2)])
if w2 == '':
return ' '.join(words)
else:
words.append(w2)
def main():
txt = 'Mary had a little lamb whose fleece was white as snow'
lookup = make_lookup(txt)
print(make_sentence(lookup))
if __name__=="__main__":
main()
Related
I have 2 strings:
s1 = "Are they here"
s2 = "yes, they are here"
I want to create a dictionary (e) that has as key the maximum number of times each shared element is present in the string that contains it the most and as value the element (i.e. the "y" is contained once in s1 and twice in s2. Therefore I want a dict that goes:
e = {2:y} # and so on
To describe my code, I thought of creating a list (c) with all the shared elements:
c = ['r', 'e', 't', 'h', 'e', 'y', 'h', 'e', 'r', 'e', 'y', 'e', 't', 'h', 'e', 'y', 'r', 'e', 'h', 'e', 'r', 'e']
then switch it to a set to eliminate duplicates and using them as iterators:
d = {'h', 'y', 'r', 't', 'e'}
Ultimately I thought of using a for loop to fill the dict (e) by iterating every element in d and reporting the maximum times it was present.
Here's my full code
please note that I don't want to use any library.
Also note that the code works with dict comprehension:
def mix(s1, s2):
c = [] # create a var to be filled with all shared chars
for i in s1:
if i != " ":
if i in s2:
c.append(i)
for i in s2:
if i != " ":
if i in s1:
c.append(i) # end of 1st process
d = set(c) # remove duplicates
e = {} # create a dict to align counting and relative char
for i in d:
a = s1.count(i)
b = s2.count(i)
m = max(a, b)
e[m] = i
# z = {i:max(s1.count(i), s2.count(i)) for i in d} this is what actually works
return e # z works instead
The issue I get is that the for loop stops after 3 iteration.
Edit: I see that Rakshith B S has made a better version of my comment, refer to thiers.
I'll start by saying I'm an amateur, and the following can absolutely be simplified.
First, decide about capitalization, A != a, use str.lower or str.upper.
Second, switching the dictionary to be {'letter':count} would make everything easier.
Then, it would most likely be easier to create two dictionaries to count the unique letters in each string.
d1 = {}
s1 = s1.lower()
for letter in s1:
if letter != " ":
if letter in d1:
d1[letter] += 1 # if in dict, add one to count
else:
d1[letter] = 1 #add new letter to dict
d2 = {}
s2 = s2.lower()
for letter in s2:
if letter != " ":
if letter in d2:
d2[letter] += 1 # if in dict, add one to count
else:
d2[letter] = 1 #add new letter to dict
That should make two dictionaries, for loop it to compare and append the max values (this part can be made more efficiently).
d3 = {}
for let in d1:
if let not in d2:
d3[let] = d1.get(let)
if let in d2:
if d1[let] >= d2[let]:
d3[let] = d1.get(let)
else:
d3[let] = d2.get(let)
for let in d2:
if let not in d1:
d3[let] = d2.get(let)
del d3[',']
This should at least get you on the right track.
I have just realized that sets can obviously have UNIQUE values as keys, so, of course my code will be display "partially".
When it gets the same key, it overwrites it.
So using the element as key will work and the for loop can be like so:
for i in d:
a = s1.count(i)
b = s2.count(i)
m = max(a, b)
e[i] = m
def mix(s1, s2):
dict1 = dict()
dict2 = dict()
for i in s1:
if i != " " and i != ",":
if i in dict1:
dict1[i] += 1
else:
dict1[i] = 1
for i in s2:
if i != " " and i != ",":
if i in dict2:
dict2[i] += 1
else:
dict2[i] = 1
# print(dict1)
# print(dict2)
for key, value in dict2.items():
if key in dict1:
# print(f' check {key}, {value}')
if value >= dict1[key]:
dict1[key] = value
else:
dict1[key] = value
#print(f' create {key}, {value}')
return {v: k for k, v in dict1.items()} #inverted
s1 = "eeeeaaabbbcccc"
s2 = "eeeeeaaa"
print(mix(s1, s2))
Why create a merged list and recheck against the counter set
Here I've compared values from dict1( which is s1) and dict2(again s2) and overwritten dict1 if the value is high else if its not found I've assigned it as the highest
OUTPUTS:
{'e': 5, 'a': 3, 'b': 3, 'c': 4}
{5: 'e', 3: 'b', 4: 'c'}
This might end up overwriting as 'a' is overwritten by 'b'
I need to display the letter and it's count if it has maximum count in a name. However, I have two letters (n:2, u:2) with equal count in a name, how to print both the letters with their count as they have maximum and equal count. I could only do for one letter.
name = 'Annuu'
name = name.lower()
names = set(name)
highest = 0
p = ''
for i in names:
if name.count(i) > highest:
highest = name.count(i)
p = i
print(f"{p} {highest}")
You can use Counter object to find the count.
Then find the maximum count to filter the letters.
from collections import Counter
name = "annuu"
count_dict = Counter(name)
max_count = max(count_dict.values())
for letter, count in count_dict.items():
if count == max_count:
print(letter, count)
This is without using any imports:
name = "Onnuu"
name = name.lower()
names = set(name)
print(names)
l = []
for i in names:
l.append((name.count(i),i))
l.sort(reverse = True)
for i in l:
if l[0][0] == i[0]:
print(i[1])
Store the values in dict and find the max_frequency
name = 'Annuu'
name = name.lower()
d={}
for i in name:
d[i]=d.get(i,0)+1
max_freq = max((d.values()))
for k,v in sorted(d.items(),key=lambda (x,y):(y,x), reverse=True):
if v == max_freq:
print(k,v)
else:
break
The following code works and produces the output:
The maximum characters and their respective count is as follows:
n 2
u 2
name = 'Annuu'
name = name.lower()
names = set(name)
name_count_dict = {} # Use dictionary because of easy mapping between character and respective max
for current_char in names:
# First save the counts in a dictionary
name_count_dict[current_char] = name.count(current_char)
# Use the max function to find the max (only one max at this point but will find the remaining in the lines below)
max_char = max(name_count_dict, key=name_count_dict.get)
max_value = name_count_dict[max_char]
# Find all other characters which match the max-value, i.e. all maximums
array_of_all_maxes = [k for k, v in name_count_dict.items() if v == max(name_count_dict.values())]
print("The maximum characters and their respective count is as follows:")
for max_chars in array_of_all_maxes:
print(f"{max_chars} {max_value}")
I think this would be a simple solution for the problem without using any external package like collections.
Here, I written 2 test cases and repeated the same lines of code. You haven't to do like that. What you have more than 2, 3 etc. So it's better to write any other function to test the code by passing different values to it.
def get_count_and_highest (name):
name = name.lower()
names = set(name)
highest = 0
d = {}
for ch in names:
count = name.count(ch)
if count >= highest:
highest = count
if highest in d:
d[highest].append(ch)
else:
d[highest] = [ch]
return highest, d
#Test case 1
highest, d = get_count_and_highest("Annuu")
l = d.get(highest, []) # in case if dictionary d is empty then l will be an empty list
output = {ch: highest for ch in l}
print(highest) # 2
print(d) # {1: ['a'], 2: ['n', 'u']}
print(l) # ['n', 'u']
print (output) # {'u': 2, 'n': 2}
# Test case 2
highest, d = get_count_and_highest("Babylon python new one")
l = d.get(highest, []) # in case if dictionary d is empty then l will be an empty list
output = {ch: highest for ch in l}
print(highest) # 4
print(d) # {3: ['o', 'p', 'l', 'b', 't', ' ', 'h', 'y', 'w'], 4: ['n', 'e', 'a']}
print(l) # ['n', 'e', 'a']
print (output) # {'n': 4, 'e': 4, 'a': 4}
An example with collections.Counter
from collections import Counter
name = 'Annuu'
c = Counter(name.lower())
mc = c.most_common()
max_count = mc[0][1]
for i, x in enumerate(mc):
if x[1] < max_count:
break
print(mc[:i+1]) # [('n', 2), ('u', 2)]
I'm writing a function 'simplify' to simplify polynomials so that simplify("2xy-yx") can return "xy", simplify("-a+5ab+3a-c-2a")can return "-c+5ab" and so on.
I am at the stage where I have broken the polynomials into multiple monomials as elements for a list and have separated the coefficient of the monomials and the letter (variable) parts.
For instance
input = '3xy+y-2x+2xy'
My process gives me:
Var = ['xy', 'y', 'x', 'xy']
Coe = ['+3', '+1', '-2', '+2']
What I want to do is to merge the same monomials and add up their corresponding coefficients in the other list simultaneously.
My code was:
Play1 = Letter[:]
Play2 = Coe[:]
for i in range(len(Play1) - 1):
for j in range(i+1, len(Play1)):
if Play1[i] == Play1[j]:
Letter.pop(j)
Coe[i] = str(int(Play2[i]) + int(Play2[j]))
Coe.pop(j)
But this seems to only work with lists where each duplicate element appears no more than twice. For instance, input of "-a+5ab+3a-c-2a" gives me:
IndexError: pop index out of range
I thought of using set, but that will change the order.
What's the best way to proceed? Thanks.
Combine your lists with zip() for easier processing, and create a new list:
newVar = []
newCoe = []
for va, co in zip(Var, Coe):
# try/except (EAFP) is very Pythonic
try:
# See if this var is seen
ind = newVar.index(va)
# Yeah, seen, let's add the coefficient
newCoe[ind] = str(int(newCoe[ind]) + int(co))
except ValueError:
# No it's not seen, add both to the new lists
newVar.append(va)
newCoe.append(co)
Because all items are processed in their original order, as well as using list appending instead of hash tables (like set and dict), the order is preserved.
This is typically a use-case where dict come in handy :
from collections import defaultdict
Var = ['xy', 'y', 'x', 'xy']
Coe = ['+3', '+1', '-2', '+2']
polynom = defaultdict(int)
for var, coeff in zip(Var, Coe):
polynom[var] += int(coeff)
Var, Coe = list(polynom.keys()), list(polynom.values())
Your input was:
input = '3xy+y-2x+2xy'
You reached till:
Var = ['xy', 'y', 'x', 'xy']
Coe = ['+3', '+1', '-2', '+2']
Use below code to get --> +5xy-y-2x
def varCo(Var, Coe):
aa = {}
for k, i in enumerate(Var):
if i in aa: aa[i] += int(Coe[k])
else : aa[i] = "" if int(Coe[k]) == 1 else "-" if int(Coe[k]) == -1 else int(Coe[k])
aa = "".join([("" if "-" in str(v) else "+") + str(v)+i for i, v in aa.items() if v != 0])
return aa
Var = ['xy', 'y', 'x', 'xy']
Coe = ['+3', '-1', '-2', '+2']
print (varCo(Var, Coe))
#Result --> +5xy-y-2x
TRY THIS:
with using regex
import re
# a = '3xy+y-2x+2xy'
a = "-a+5ab+3a-c-2a"
i = re.findall(r"[\w]+", a)
j = re.findall(r"[\W]+", a)
if len(i)!=len(j):
j.insert(0,'+')
d = []
e = []
for k in i:
match = re.match(r"([0-9]+)([a-z]+)", k, re.I)
if match:
items = match.groups()
d.append(items[0])
e.append(items[1])
else:
d.append('1')
e.append(k)
print(e)
f = []
for ii,jj in zip(j,d):
f.append(ii+jj)
print(f)
Input:
a = "-a+5ab+3a-c-2a"
Output:
['a', 'ab', 'a', 'c', 'a']
['-1', '+5', '+3', '-1', '-2']
Input:
a = '3xy+y-2x+2xy'
Output:
['xy', 'y', 'x', 'xy']
['+3', '+1', '-2', '+2']
I want to make a code which counts all triplets in a sequence. I've read a plenty of posts so far, but none of them could help me.
This is my code:
def cnt(seq):
mydict = {}
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
b = ''.join(a[(0+3*i):(3+3*i)])
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if b == triplet:
mydict[b] = 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
Does Biopython provide a function to solve this problem?
EDIT:
Note that, for instance, in the sequence 'ATGAAG', 'TGA' or 'GAA' are not "valid" triplets, only 'ATG' and 'AAG', because in biology and bioinformatics, we read it 'ATG' and 'AAG', thats the information we need to translate it or whatever else.
You can imagine it as a sequence of words, for example "Hello world". The way we read it is "Hello" and "world", not "Hello", "ello ", "llo w",...
It took me a while to understand that you do not want to count the number of codons but the frequency of each codon. Your title is a bit misleading in this respect. Anyway, you can employ collections.Counter for your task:
from collections import Counter
def cnt(seq):
if len(seq) % 3 == 0:
#split list into codons of three
codons = [seq[i:i+3] for i in range(0, len(seq), 3)]
#create Counter dictionary for it
codon_freq = Counter(codons)
#determine number of codons, should be len(seq) // 3
n = sum(codon_freq.values())
#print out all entries in an appealing form
for key in sorted(codon_freq):
print("{}: {} = {:5.2f}%".format(key, codon_freq[key], codon_freq[key] * 100 / n))
#or just the dictionary
#print(codon_freq)
else:
print("Error")
seq = "ATCGCAGAAATCCGCAGAATC"
cnt(seq)
Sample output:
AGA: 1 = 14.29%
ATC: 3 = 42.86%
CGC: 1 = 14.29%
GAA: 1 = 14.29%
GCA: 1 = 14.29%
You can use clever techniques, as suggested in the other answers, but I will build a solution starting from your code, which is almost working: Your problem is that every time you do mydict[b] = 1, you reset the count of b to 1.
A minimal fix
You could solve this by testing if the key is present, if not, create the entry in the dict, then increment the value, but there are more convenient tools in python.
A minimal change to your code would be to use a defaultdict(int) instead of a dict. Whenever a new key is encountered, it is assumed to have the associated default value for an int: 0. So you can increment the value instead of resetting:
from collections import defaultdict
def cnt(seq):
# instanciate a defaultdict that creates ints when necessary
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
b = ''.join(a[(0+3*i):(3+3*i)])
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if b == triplet:
# increment the existing count (or the default 0 value)
mydict[b] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
It works as desired:
cnt("ACTGGCACT")
ACT: 2
GGC: 1
Some possible improvements
Now let's try to improve your code a bit.
First, as I wrote in the comments, let's avoid the un-necessary conversion of your sequence to a list, and use a better variable name for the currently counted codon:
from collections import defaultdict
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
for base1 in ['A', 'T', 'G', 'C']:
for base2 in ['A', 'T', 'G', 'C']:
for base3 in ['A', 'T', 'G', 'C']:
triplet = base1 + base2 + base3
if codon == triplet:
mydict[codon] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
Now lets simplify the nested loop part, trying all possible codons, by generating in advance the set of possible codons:
from collections import defaultdict
from itertools import product
codons = {
"".join((base1, base2, base3))
for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
if codon in codons:
mydict[codon] += 1
for key in sorted(mydict):
print("%s: %s" % (key, mydict[key]))
else:
print("Error")
Now, your code simply ignores the triplets that are not valid codons. Maybe you should instead issue a warning:
from collections import defaultdict
from itertools import product
codons = {
"".join((base1, base2, base3))
for (base1, base2, base3) in product("ACGT", "ACGT", "ACGT")}
def cnt(seq):
mydict = defaultdict(int)
if len(seq) % 3 == 0:
a = [x for x in seq]
for i in range(len(seq)//3):
codon = seq[(0+3*i):(3+3*i)]
# We count even invalid triplets
mydict[codon] += 1
# We display counts only for valid triplets
for codon in sorted(codons):
print("%s: %s" % (codon, mydict[codon]))
# We compute the set of invalid triplets:
# the keys that are not codons.
invalid = mydict.keys() - codons
# An empty set has value False in a test.
# We issue a warning if the set is not empty.
if invalid:
print("Warning! There are invalid triplets:")
print(", ".join(sorted(invalid)))
else:
print("Error")
A more fancy solution
Now a more fancy solution, using cytoolz (probably needs to be installed because it is not part of usual python distributions: pip3 install cytoolz, if you are using pip):
from collections import Counter
from itertools import product, repeat
from cytoolz import groupby, keymap, partition
# To make strings out of lists of strings
CAT = "".join
# The star "extracts" the elements from the result of repeat,
# so that product has 3 arguments, and not a single one
codons = {CAT(bases) for bases in product(*repeat("ACGT", 3))}
def cnt(seq):
# keymap(CAT, ...) transforms the keys (that are tuples of letters)
# into strings
# if len(seq) is not a multiple of 3, pad="-" will append "-"
# to complete the last triplet (which will be an invalid one)
codon_counts = keymap(CAT, Counter(partition(3, seq, pad="-")))
# separate encountered codons into valids and invalids
codons_by_validity = groupby(codons.__contains__, codon_counts.keys())
# get allows to provide a default value,
# in case one of the categories is not present
valids = codons_by_validity.get(True, [])
invalids = codons_by_validity.get(False, [])
# We display counts only for valid triplets
for codon in sorted(valids):
print("%s: %s" % (codon, codon_counts[codon]))
# We issue a warning if there are invalid codons.
if invalids:
print("Warning! There are invalid triplets:")
print(", ".join(sorted(invalids)))
Hope this helps.
You could do something like this:
from itertools import product
seq = 'ATGATG'
all_triplets = [seq[i:i+3] for i in range(len(seq)) if i <= len(seq)-3]
# this gives ['ATG', 'TGA', 'GAT', 'ATG']
# add more valid_triplets here
valid_triplets = ['ATG']
len([(i, j) for i, j in product(valid_triplets, all_triplets) if i==j])
Output:
2
It is unclear what output is expected. Here we use one of many grouping functions from more_itertools to build adjacent triplets or "codons".
import more_itertools as mit
seq = "ATGATG"
codons = ["".join(w) for w in mit.grouper(3, seq)]
codons
# ['ATG', 'ATG']
Count the number of codons by calling len.
len(triplets)
# 2
For more detailed analysis, consider splitting the problem into smaller functions that (1) extract codons and (2) compute occurrences.
Code
import collections as ct
def split_codons(seq):
"Return codons from a sequence; raise for bad sequences."
for w in mit.windowed(seq, n=3, step=3, fillvalue=""):
part = "".join(w)
if len(part) < 3:
raise ValueError(f"Sequence not divisible by 3. Got extra '{part}'.")
yield part
def count_codons(codons):
"""Return dictionary of codon occurences."""
dd = ct.defaultdict(int)
for i, c in enumerate(codons, 1):
dd[c] += 1
return {k: (v, 100 * v/i) for k, v in dd.items()}
Demo
>>> seq = "ATCGCAGAAATCCGCAGAATC"
>>> bad_seq = "ATCGCAGAAATCCGCAGAATCA"
>>> list(split_codons(seq))
['ATC', 'GCA', 'GAA', 'ATC', 'CGC', 'AGA', 'ATC']
>>> list(split_codons(bad_seq))
ValueError: Sequence not divisible by 3. Got extra 'A'.
>>> count_codons(split_codons(seq))
{'ATC': (3, 42.857142857142854),
'GCA': (1, 14.285714285714286),
'GAA': (1, 14.285714285714286),
'CGC': (1, 14.285714285714286),
'AGA': (1, 14.285714285714286)}
What I would like to do is something like this:
testdictionary = {"a":1, "b":2, "c":3, "A":4}
list1 = []
list2 = []
keyval = 200
for char in string:
i = 0
y = "".join(list1)
while y in testdictionary:
list1.append(string[i])
i +=1
list2.append(y[:-1])
testdictionary[y] = keyval
keyval +=1
string = string[((len(list1))-1):]
list1 = []
So for a string "abcacababa" the desired output would be:
['ab', 'ca', 'cab', 'aba']
Or "AAAAA" would be
['A', 'AA'. 'AA']
Take abcacababa. Iterating through we get a which is in testdictionary so we append list1 again. This time we have ab which is not in the dictionary, so we add it as a key to testdictionary with a value of 200. Then doing the same process again, we add ca to testdictionary with a value of 201. Then since we have already added ca, the next value appended to list2 would be cab and so on.
What I am trying to do is take a string and compare each character against a dictionary, if the character is a key in the dictionary add another character, do this until it is not in the dictionary at which point add it to the dictionary and assign a value to it, keep doing this for the whole string.
There's obviously a lot wrong with this code, it also doesn't work. The i index being out of range but I have no idea how to approach this iteration. Also I need to add in an if statement to ensure the "leftovers" of the string at the end are appended to list2. Any help is appreciated.
I think I get it now #Boa. This code I believe works for abcacababa at least. As for leftovers, I think it's only possible to have a single 'leftover' key when the last key is in the test dictionary, so you just have to check after the loop if curr_key is not empty:
testdictionary = {"a":1, "b":2, "c":3, "A":4}
word = 'abcacababa'
key_val = 200
curr_key = ''
out_lst = []
let_ind = 0
for let in word:
curr_key += let
if curr_key not in testdictionary:
out_lst.append(curr_key)
testdictionary[curr_key] = key_val
key_val += 1
curr_key = ''
leftover = curr_key
print(out_lst)
print(testdictionary)
Output:
['ab', 'ca', 'cab', 'aba']
{'a': 1, 'A': 4, 'c': 3, 'b': 2, 'aba': 203, 'ca': 201, 'ab': 200, 'cab': 202}
Please let me know if anything is unclear. Also I think your second example with AAAAA should be ['AA', 'AAA'] instead of ['A', 'AA', 'AA']