Finding consecutive consonants in a word - python

I need code that will show me the consecutive consonants in a word. For example, for "concertation" I need to obtain ["c","nc","rt","t","n"].
Here is my code:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes += x + ''
return consonnes
I manage to find the consonants, but I don't see how to find them consecutively. Can anybody tell me what I need to do?

You can use regular expressions, implemented in the re module
Better solution
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation", re.IGNORECASE)
['c', 'nc', 'rt', 't', 'n']
[bcdfghjklmnprstvyz]+ matches any sequence of one or more characters from the character class
re.IGNORECASE enables a case in sensitive match on the characters. That is
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "CONCERTATION", re.IGNORECASE)
['C', 'NC', 'RT', 'T', 'N']
Another Solution
>>> import re
>>> re.findall(r'[^aeiou]+', "concertation",)
['c', 'nc', 'rt', 't', 'n']
[^aeiou] Negated character class. Matches anything character other than the one in this character class. That is in short Matches consonents in the string
+ quantifer + matches one or more occurence of the pattern in the string
Note This will also find the non alphabetic, adjacent characters in the solution. As the character class is anything other than vowels
Example
>>> re.findall(r'[^aeiou]+', "123concertation",)
['123c', 'nc', 'rt', 't', 'n']
If you are sure that the input always contain alphabets, this solution is ok
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
The string is scanned left-to-right, and matches are returned in the order found.
If you are curious about how the result is obtained for
re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation")
concertation
|
c
concertation
|
# o is not present in the character class. Matching ends here. Adds match, 'c' to ouput list
concertation
|
n
concertation
|
c
concertation
|
# Match ends again. Adds match 'nc' to list
# And so on

You could do this with regular expressions and the re module's split function:
>>> import re
>>> re.split(r"[aeiou]+", "concertation", flags=re.I)
['c', 'nc', 'rt', 't', 'n']
This method splits the string whenever one or more consecutive vowels are matched.
To explain the regular expression "[aeiou]+": here the vowels have been collected into a class [aeiou] while the + indicates that one or more occurrence of any character in this class can be matched. Hence the string "concertation" is split at o, e, a and io.
The re.I flag means that the case of the letters will be ignored, effectively making the character class equal to [aAeEiIoOuU].
Edit: One thing to keep in mind is that this method implicitly assumes that the word contains only vowels and consonants. Numbers and punctuation will be treated as non-vowels/consonants. To match only consecutive consonants, instead use re.findall with the consonants listed in the character class (as noted in other answers).
One useful shortcut to typing out all the consonants is to use the third-party regex module instead of re.
This module supports set operations, so the character class containing the consonants can be neatly written as the entire alphabet minus the vowels:
[[a-z]--[aeiou]] # equal to [bcdefghjklmnpqrstvwxyz]
Where [a-z] is the entire alphabet, -- is set difference and [aeiou] are the vowels.

If you are up for a non-regex solution, itertools.groupby would work perfectly fine here, like this
>>> from itertools import groupby
>>> is_vowel = lambda char: char in "aAeEiIoOuU"
>>> def suiteConsonnes(in_str):
... return ["".join(g) for v, g in groupby(in_str, key=is_vowel) if not v]
...
>>> suiteConsonnes("concertation")
['c', 'nc', 'rt', 't', 'n']

A really, really simple solution without importing anything is to replace the vowels with a single thing, then split on that thing:
def SuiteConsonnes(mot):
consonnes = ''.join([l if l not in "aeiou" else "0" for l in mot])
return [c for c in consonnes.split("0") if c is not '']
To keep it really similar to your code - and to add generators - we get this:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes.append(x)
elif consonnes:
yield ''.join(consonnes)
consonnes = []
if consonnes: yield ''.join(consonnes)

def SuiteConsonnes(mot):
consonnes=[]
consecutive = '' # initialize consecutive string of consonants
for x in mot:
if x in "aeiou": # checks if x is not a consonant
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append consecutive string to consonnes
consecutive = '' # reinitialize consecutive for another consecutive string of consonants
else:
consecutive += x # add x to consecutive string if x is a consonant or not a vowel
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append last consecutive string of consonants
return consonnes
SuiteConsonnes('concertation')
#['c', 'nc', 'rt', 't', 'n']

Not that I'd recommend it for readability, but a one-line solution is:
In [250]: q = "concertation"
In [251]: [s for s in ''.join([l if l not in 'aeiou' else ' ' for l in q]).split()]
Out[251]: ['c', 'nc', 'rt', 't', 'n']
That is: join the non-vowels with spaces and split again on whitespace.

Use regular expressions from re built-in module:
import re
def find_consonants(string):
# find all non-vovels occuring 1 or more times:
return re.findall(r'[^aeiou]+', string)

Although I think you should go with #nu11p01n73R's answer, this will also work:
re.sub('[AaEeIiOoUu]+',' ','concertation').split()

Related

split a string into a minimal number of unique substrings

Given a string consisting of lowercase letters.
Need to split this string into a minimal number of substrings in such a way that no letter occurs more than once in each substring.
For example, here are some correct splits of the string "abacdec":
('a', 'bac', 'dec'), ('a', bacd', 'ec') and (ab', 'ac', 'dec').
Given 'dddd', function should return 4. The result can be achieved by splitting the string into four substrings ('d', 'd', 'd', 'd').
Given 'cycle', function should return 2. The result can be achieved by splitting the string into two substrings ('cy', 'cle') or ('c', 'ycle').
Given 'abba', function should return 2 (I believe it should be 1 - the mistake as originally stated). The result can be achieved by splitting the string into two substrings ('ab', 'ba')
Here is a code which I've written. I feel that it is too complicated and also not sure whether it is efficient in matter of time complexity.
I would be glad to have suggestions of a shorter and simpler one. Thanks!
#!/usr/bin/python
from collections import Counter
def min_distinct_substrings(string):
# get the longest and unique substring
def get_max_unique_substr(s):
def is_unique(substr):
return all(m == 1 for m in Counter(substr).values())
max_sub = 0
for i in range(len(s), 0, -1):
for j in range(0, i):
if len(s[j:i]):
if is_unique(s[j:i]):
substr_len = len(s[j:i])
else:
substr_len = 0
max_sub = max(max_sub, substr_len)
return max_sub
max_unique_sub_len = get_max_unique_substr(string)
out = []
str_prefx = []
# get all valid prefix - 'a', 'ab' are valid - 'aba' not valid since 'a' is not unique
for j in range(len(string)):
if all(m==1 for m in Counter(string[:j + 1]).values()):
str_prefx.append(string[:j + 1])
else:
break
# consider only valid prefix
for k in str_prefx:
# get permutation substrings; loop starts from longest substring to substring of 2
for w in range(max_unique_sub_len, 1, -1):
word = ''
words = [k] # first substring - the prefix - is added
# go over the rest of the string - start from position after prefix
for i in range(len(k), len(string)):
# if letter already seen - the substring will be added to words
if string[i] in word or len(word) >= w:
words.append(word)
word = ''
# if not seen and not last letter - letter is added to word
word += string[i]
words.append(word)
if words not in out: # to avoid duplicated words' list
out.append(words)
min_list = min(len(i) for i in out) # get the minimum lists
# filter the minimum lists (and convert to tuple for printing purposes)
out = tuple( (*i, ) for i in out if len(i) <= min_list )
return out
The greedy algorithm described by Tarik works well to efficiently get the value of the minimal number of substrings. If you want to find all of the valid splits you have to check them all though:
import itertools
def min_unique_substrings(w):
def all_substrings_are_unique(ss):
return all(len(set(s)) == len(s) for s in ss)
# check if input is already unique
if all_substrings_are_unique([w]):
return [[w]]
# divide the input string into parts, starting with the fewest divisions
for divisions in range(2, len(w)-1):
splits = []
for delim in itertools.combinations(range(1, len(w)), divisions-1):
delim = [0, *delim, len(w)]
substrings = [w[delim[i]:delim[i+1]] for i in range(len(delim)-1)]
splits.append(substrings)
# check if there are any valid unique substring splits
filtered = list(filter(all_substrings_are_unique, splits))
if len(filtered):
# if there are any results they must be divided into the
# fewest number of substrings and we can stop looking
return filtered
# not found; worst case of one character per substring
return [list(w)]
> print(min_unique_substrings('abacdec'))
[['a', 'bac', 'dec'], ['a', 'bacd', 'ec'], ['a', 'bacde', 'c'], ['ab', 'ac', 'dec'], ['ab', 'acd', 'ec'], ['ab', 'acde', 'c']]
> print(min_unique_substrings('cycle'))
[['c', 'ycle'], ['cy', 'cle']]
> print(min_unique_substrings('dddd'))
[['d', 'd', 'd', 'd']]
> print(min_unique_substrings('abba'))
[['ab', 'ba']]
> print(min_unique_substrings('xyz'))
[['xyz']]
Ok, thought about it again. Using a greedy algorithm should do. Loop through the letters in order. Accumulate a substring as long as no letter is repeated. Once a duplicate letter is found, spit out the substring and start with another substring until all letters are exhausted.

Splitting string into words with regex

How can I split string with regex into words not longer than 3 characters like:
Input
"ads1323z123123c123123890sdfakslk123klaad,313ks"
Output
['ads', 'z', 'c', 'ks']
You can use re.split:
import re
s = "ads1323z123123c123123890sdfakslk123klaad,313ks"
results = list(filter(lambda x:len(x) <= 3, re.split('[^a-zA-Z]+', s)))
Output:
['ads', 'z', 'c', 'ks']
You can also use lookahead and lookbehind expressions to match only 3-character words:
import re
s = "ads1323z123123c123123890sdfakslk123klaad,313ks"
re.findall('(?<![a-zA-Z])[a-zA-Z]{1,3}(?![a-zA-Z])', s)
Output:
['ads', 'z', 'c', 'ks']
The regular expression works like this: the middle part [a-zA-Z]{1,3} says "match 1 to 3 alphabetic characters". The first part (?<![a-z][A-Z]) is a negative lookbehind assertion that asserts that the 3 alphabetic characters are not preceded by an alphabetic character. the last part (?![a-zA-Z]) is a negative lookahead assertion that asserts that the 3 alphabetic characters are not followed by an alphabetic character.

Python: replace an exact matching substring with variable

I have a list of strings like 'cdbbdbda', 'fgfghjkbd', 'cdbbd' etc. I have also a variable fed from another list of strings. What I need is to replace a substring in the first list's strings, say b by z, only if it is preceeded by a substring from the variable list, all the other occurrences being intouched.
What I have:
a = ['cdbbdbda', 'fgfghjkbd', 'cdbbd']
c = ['d', 'f', 'l']
What I do:
for i in a:
for j in c:
if j+'b' in i:
i = re.sub('b', 'z', i)
What I need:
'cdzbdzda'
'fgfghjkbd'
'cdzbd'
What I get:
'cdzzdzda'
'fgfghjkbd'
'cdzzd'
all instances of 'b' are replaced.
I'm new in it, any help is very welcome. Looking for answer at Stackoverflow I have found many solutions with regex based on word boundaries or with re either with str.replace based on count, but I can't use it as the lenght of the string and number of occurrences of 'b' can vary.
I think if you include j in the find and replace, you'll get what you want.
>>> for i in a:
... for j in c:
... i = re.sub(j+'b', j+'z', i)
... print i
...
cdzbdzda
fgfghjkbd
cdzbd
>>>
I added print i because your loop doesn't make in-place changes, so without that output, it's not possible to see what replacements were made.
You should simply use regular expressions with a positive lookbehind assertion.
Like this:
import re
for i in a:
for j in c:
i = re.sub('(?<=' + j + ')b', 'z', i)
The base case is:
re.sub('(?<=d)b', 'z', 'cdbbdbda')
You can use a list comprehension:
import re
a = ['cdbbdbda', 'fgfghjkbd', 'cdbbd']
c = ['d', 'f', 'l']
new_a = [re.sub('|'.join('(?<={})b'.format(i) for i in c), 'z', b) for b in a]
Output:
['cdzbdzda', 'fgfghjkbd', 'cdzbd']

Isolating letters in a string and converting them to a list

I want to be able to isolate the letters in a string from an input and return them as a collection containing 4 separate lower case characters.
This is what I have so far:
def main():
original = input(str("Enter a 4-letter word: "))
letters = isolate_letters(original)
def isolate_letters(original):
letters = list(original.items())
return letters
main()
>>> s = "1234"
>>> list(s)
['1', '2', '3', '4']
You want the first 4 lowercase characters:
letters = [c for c in original.lower() if c.isalpha()][:5]
str.isalpha
str.lower
First you convert the string to lowercase (lower()), then pick out all the alphabet characters from the string (isalpha()) and finally slice off the first 4 ([:5])
I presume you want to filter for the ascii letters. If not, a python string is iterable, so for most practical applications a string behaves just like a list - try just using the string as if it was a list.
If you want all letters:
>>> import string
>>> original = "foo, bar, 2014"
>>> letters = [c for c in original if c in string.ascii_letters]
['f', 'o', 'o', 'b', 'a', 'r']
If you want them without repetitions:
>>> unique_letters = set(letters)
>>> unique_letters
{'a', 'b', 'f', 'o', 'r'}
[update]
Thanks a lot! im new to python is there any chance you could explaiin how this works please?
Well, string.ascii_letters contains:
'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ'
We are using Python list comprehensions, a syntax based on the set-builder notation in math:
[item for item in some_list if condition]
It is almost plain english: return a list of items in some_list if condition is true. In our case, we are testing if the character is an ascii letter using the in operator.
The set object in Python is an unordered list that ensure there is no repeated items.

If a string ends in a vowel of the following characters?

If a string doesn't end in s, x, ch, sh or if it doesn't ends in a vowel, then I need to add a 's'. This works for tell and eat but doesn't work for show. I am not able to figure out why.
This is the code I wrote:
if not re.match(".*(s|x|ch|sh)",s):
if re.match(".*(a|e|i|o|u)",s):
s = s+'s'
return s
else:
return s
Use endswith instead. It takes a single string, or a tuple of strings, and returns True if the string has any of the given arguments as a suffix.
cons = ('s', 'x', 'ch', 'sh')
vowels = ('a', 'e', 'i', 'o', 'u')
if not s.endswith(cons + vowels):
s += 's'
return s
You need $ at the end of your regex if you only want to match the end of the string.

Categories