Splitting string into words with regex - python

How can I split string with regex into words not longer than 3 characters like:
Input
"ads1323z123123c123123890sdfakslk123klaad,313ks"
Output
['ads', 'z', 'c', 'ks']

You can use re.split:
import re
s = "ads1323z123123c123123890sdfakslk123klaad,313ks"
results = list(filter(lambda x:len(x) <= 3, re.split('[^a-zA-Z]+', s)))
Output:
['ads', 'z', 'c', 'ks']

You can also use lookahead and lookbehind expressions to match only 3-character words:
import re
s = "ads1323z123123c123123890sdfakslk123klaad,313ks"
re.findall('(?<![a-zA-Z])[a-zA-Z]{1,3}(?![a-zA-Z])', s)
Output:
['ads', 'z', 'c', 'ks']
The regular expression works like this: the middle part [a-zA-Z]{1,3} says "match 1 to 3 alphabetic characters". The first part (?<![a-z][A-Z]) is a negative lookbehind assertion that asserts that the 3 alphabetic characters are not preceded by an alphabetic character. the last part (?![a-zA-Z]) is a negative lookahead assertion that asserts that the 3 alphabetic characters are not followed by an alphabetic character.

Related

Regex to match all repeating alphanumerical subpatterns [duplicate]

This question already has answers here:
How to use regex to find all overlapping matches
(5 answers)
Closed 2 years ago.
After searching for a while, I could only find how to match specific subpattern repetitions. Is there a way I can find (3 or more) repetitions for any subpattern ?
For example:
re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
→ ['a', 'b', 'x', 'aaabbbxxx_']
re.findall(<the_regex>, 'lalala luuluuluul')
→ ['la', 'luu', 'uul']
I apologize in advance if this is a duplicate and would be grateful to be redirected to the original question.
Using this lookahead based regex you may not get exactly as you are showing in question but will get very close.
r'(?=(.+)\1\1)'
RegEx Demo
Code:
>>> reg = re.compile(r'(?=(.+)\1\1)')
>>> reg.findall('aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
>>> reg.findall('lalala luuluuluul')
['la', 'luu', 'uul']
RegEx Details:
Since we're using a lookahead as full regex we are not really consuming character since lookahead is a zero width match. This allows us to return overlapping matches from input.
Using findall we only return capture group in our regex.
(?=: Start lookahead
(.+): Match 1 or more of any character (greedy) and capture in group #1
\1\1: Match 2 occurrence of group #1 using back-reference \1\1
): End lookahead
re.findall() won't find overlapping matches. But you can find the non-overlapping matches using a capture group followed by a positive lookahead that matches a back-reference to that group.
>>> import re
>>> regex = r'(.+)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['aaabbbxxx_', 'a', 'b', 'x', 'a', 'b', 'x']
>>> re.findall(regex, 'lalala luuluuluul')
['la', 'luu']
>>>
This will find the longest matches; if you change (.+) to (.+?) you'll get the shortest matches at each point.
>>> regex = r'(.+?)(?=\1{2})'
>>> re.findall(regex, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_')
['a', 'b', 'x', 'a', 'b', 'x', 'a', 'b', 'x']
It is not possible without defining the subpattern first.
Anyway, if the subpattern is just <any_alphanumeric>, then re.findall(<the_regex>, 'aaabbbxxx_aaabbbxxx_aaabbbxxx_') would produce something like this :
['a', 'b', 'x', 'aa', 'ab', 'bb', 'bx', 'xx', 'x_', 'aaa', 'aaab', 'aaabb', ....]
ie, every alphanumeric combination that is repeated thrice - so a lot of combinations, not just ['a', 'b', 'x', 'aaabbbxxx_']

Pattern in regular expressions that match pattern_1 and not match pattern_2? [duplicate]

This question already has an answer here:
Find 'word' not followed by a certain character
(1 answer)
Closed 3 years ago.
I'd like to write a function that matches in a string and returns all uppercase letters that are not followed by a lowercase letter.
def find_sub_string(text, pattern):
if re.search(pattern, text):
return re.findall(pattern, text)
else:
return None
Here is an example of a desired output:
>>> find_sub_string('AaAcDvFF(G)F.E)f', pattern)
>>> ['F', 'F', 'G', 'F', 'E']
What should be the right pattern in that case?
I tried pattern='[A-Z][^a-z], but it produce the following output ['FF', 'G)', 'F.', 'E)'], which satisfies the described condition, except it returns two letters except one.
USe negative look ahead in rgex
import re
def find_sub_string(text, pattern):
if re.search(pattern, text):
return re.findall(pattern, text)
else:
return None
find_sub_string('AaAcDvFF(G)F.E)f', '[A-Z](?![a-z])')

[] followed by () in regex altering the meaning of [] in python

My regex expression is re.findall("[2]*(.)","b = 2 + a*10");
Its output: ['b', ' ', '=', ' ', ' ', '+', ' ', 'a', '*', '1', '0']
But from the expression what I can infer is it should give all strings starting with o or more times 2 followed by anything, which should give all characters including 2! But there is not 2 in the output? It is actually omitting the characters inside [] which I concluded after replacing 2 with any other character But unable to understand why it is happening? Why [] followed by () omitting characters inside [].
Read the docs for re.findall:
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
So when you include (.) in your pattern, re.findall will return the contents of that group.

Finding consecutive consonants in a word

I need code that will show me the consecutive consonants in a word. For example, for "concertation" I need to obtain ["c","nc","rt","t","n"].
Here is my code:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes += x + ''
return consonnes
I manage to find the consonants, but I don't see how to find them consecutively. Can anybody tell me what I need to do?
You can use regular expressions, implemented in the re module
Better solution
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation", re.IGNORECASE)
['c', 'nc', 'rt', 't', 'n']
[bcdfghjklmnprstvyz]+ matches any sequence of one or more characters from the character class
re.IGNORECASE enables a case in sensitive match on the characters. That is
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "CONCERTATION", re.IGNORECASE)
['C', 'NC', 'RT', 'T', 'N']
Another Solution
>>> import re
>>> re.findall(r'[^aeiou]+', "concertation",)
['c', 'nc', 'rt', 't', 'n']
[^aeiou] Negated character class. Matches anything character other than the one in this character class. That is in short Matches consonents in the string
+ quantifer + matches one or more occurence of the pattern in the string
Note This will also find the non alphabetic, adjacent characters in the solution. As the character class is anything other than vowels
Example
>>> re.findall(r'[^aeiou]+', "123concertation",)
['123c', 'nc', 'rt', 't', 'n']
If you are sure that the input always contain alphabets, this solution is ok
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
The string is scanned left-to-right, and matches are returned in the order found.
If you are curious about how the result is obtained for
re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation")
concertation
|
c
concertation
|
# o is not present in the character class. Matching ends here. Adds match, 'c' to ouput list
concertation
|
n
concertation
|
c
concertation
|
# Match ends again. Adds match 'nc' to list
# And so on
You could do this with regular expressions and the re module's split function:
>>> import re
>>> re.split(r"[aeiou]+", "concertation", flags=re.I)
['c', 'nc', 'rt', 't', 'n']
This method splits the string whenever one or more consecutive vowels are matched.
To explain the regular expression "[aeiou]+": here the vowels have been collected into a class [aeiou] while the + indicates that one or more occurrence of any character in this class can be matched. Hence the string "concertation" is split at o, e, a and io.
The re.I flag means that the case of the letters will be ignored, effectively making the character class equal to [aAeEiIoOuU].
Edit: One thing to keep in mind is that this method implicitly assumes that the word contains only vowels and consonants. Numbers and punctuation will be treated as non-vowels/consonants. To match only consecutive consonants, instead use re.findall with the consonants listed in the character class (as noted in other answers).
One useful shortcut to typing out all the consonants is to use the third-party regex module instead of re.
This module supports set operations, so the character class containing the consonants can be neatly written as the entire alphabet minus the vowels:
[[a-z]--[aeiou]] # equal to [bcdefghjklmnpqrstvwxyz]
Where [a-z] is the entire alphabet, -- is set difference and [aeiou] are the vowels.
If you are up for a non-regex solution, itertools.groupby would work perfectly fine here, like this
>>> from itertools import groupby
>>> is_vowel = lambda char: char in "aAeEiIoOuU"
>>> def suiteConsonnes(in_str):
... return ["".join(g) for v, g in groupby(in_str, key=is_vowel) if not v]
...
>>> suiteConsonnes("concertation")
['c', 'nc', 'rt', 't', 'n']
A really, really simple solution without importing anything is to replace the vowels with a single thing, then split on that thing:
def SuiteConsonnes(mot):
consonnes = ''.join([l if l not in "aeiou" else "0" for l in mot])
return [c for c in consonnes.split("0") if c is not '']
To keep it really similar to your code - and to add generators - we get this:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes.append(x)
elif consonnes:
yield ''.join(consonnes)
consonnes = []
if consonnes: yield ''.join(consonnes)
def SuiteConsonnes(mot):
consonnes=[]
consecutive = '' # initialize consecutive string of consonants
for x in mot:
if x in "aeiou": # checks if x is not a consonant
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append consecutive string to consonnes
consecutive = '' # reinitialize consecutive for another consecutive string of consonants
else:
consecutive += x # add x to consecutive string if x is a consonant or not a vowel
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append last consecutive string of consonants
return consonnes
SuiteConsonnes('concertation')
#['c', 'nc', 'rt', 't', 'n']
Not that I'd recommend it for readability, but a one-line solution is:
In [250]: q = "concertation"
In [251]: [s for s in ''.join([l if l not in 'aeiou' else ' ' for l in q]).split()]
Out[251]: ['c', 'nc', 'rt', 't', 'n']
That is: join the non-vowels with spaces and split again on whitespace.
Use regular expressions from re built-in module:
import re
def find_consonants(string):
# find all non-vovels occuring 1 or more times:
return re.findall(r'[^aeiou]+', string)
Although I think you should go with #nu11p01n73R's answer, this will also work:
re.sub('[AaEeIiOoUu]+',' ','concertation').split()

Regex Group Matching

I am trying to match a pattern like so:
Pattern: (abc)(def)(ghi)h
Match:
Group 0 = [a,b,c]
Group 1 = [d,e,f]
Group 2 = [g,h,i]
Group 3 = h
Is it possible via regex to extrapolate the data into a list like that described?
The code being used is Python for reference.
AFAIK, that's not possible in one regex. You could do something like this:
import re
matches = re.findall('[^()]+', '(abc)(def)(ghi)h')
map = []
for m in matches:
map.append(list(m))
for e in map:
print e
which will print:
['a', 'b', 'c']
['d', 'e', 'f']
['g', 'h', 'i']
['h']
EDIT
The pattern [^()] matches any character other than a ( and ), so [^()]+ matches one or more characters other than ( and ).
Everything between a [ and ] is called a character class, and will always match just a single character. The ^ at the start makes it a negated character class (matches everything-but what is defined in it).
More info about character classes: http://www.regular-expressions.info/charclass.html

Categories