[] followed by () in regex altering the meaning of [] in python - python

My regex expression is re.findall("[2]*(.)","b = 2 + a*10");
Its output: ['b', ' ', '=', ' ', ' ', '+', ' ', 'a', '*', '1', '0']
But from the expression what I can infer is it should give all strings starting with o or more times 2 followed by anything, which should give all characters including 2! But there is not 2 in the output? It is actually omitting the characters inside [] which I concluded after replacing 2 with any other character But unable to understand why it is happening? Why [] followed by () omitting characters inside [].

Read the docs for re.findall:
If one or more groups are present in the pattern, return a
list of groups; this will be a list of tuples if the pattern
has more than one group.
So when you include (.) in your pattern, re.findall will return the contents of that group.

Related

Recreating the strip() method using list comprehensions but output returns unexpected result

I am trying to 'recreate' the str.split() method in python for fun.
def ourveryownstrip(string, character):
newString = [n for n in string if n != character]
return newString
print("The string is", ourveryownstrip(input("Enter a string.`n"), input("enter character to remove`n")))
The way it works is that I create a function that passes in two arguments: 1) the first one is a string supplied, 2) the second is a a string or char that the person wants to remote/whitespace to be moved from the string. Then I use a list comprehension to store the 'modified' string as a new list by using a conditional statement. Then it returns the modified string as a list.
The output however, returns the entire thing as an array with every character in the string separated by a comma.
Expected output:
Boeing 747 Boeing 787
enter character to removeBoeing
The string is ['B', 'o', 'e', 'i', 'n', 'g', ' ', '7', '4', '7', ' ', 'B', 'o', 'e', 'i', 'n', 'g', ' ', '7', '8', '7']
How can I fix this?
What you have set up is checking each individual character in a list and seeing if it matches 'Boeing' which will never be true so it will always return the whole input. It is returning it as a list because using list comprehension makes a list. Like #BrutusForcus said this can be solved using string slicing and the string.index() function like this:
def ourveryownstrip(string,character):
while character in string:
string = string[:string.index(character)] + string[string.index(character)+len(character):]
return string
This will first check if the value you want removed is in your string. If it is then string[:string.index(character)] will get all of the string before the first occurrence of the character variable value and string[string.index(character)+len(character):] will get everything in the string after the first occurrence of the variable value. That will keep happening until the variable value doesn't occur in the string anymore.

How to find characters not in parentheses

Attempting to find all occurrences of characters
string1 = '%(example_1).40s-%(example-2)_-%(example3)s_'
so that output has all occurrences of '-' '_' not in parentheses
['-', '_', '-', '_']
Do not need to care about nested parentheses
You can use module re to do that by passing regex to it
import re
str = '%(example_1).40s-%(example-2)_-%(example3)s_'
#remove all occurences of paratheses and what is inside
tmpStr = re.sub('\(([^\)]+)\)', '', str)
#take out other element except your caracters
tmpStr = re.sub('[^_-]', '', tmpStr)
#and transform it to list
result_list = list(tmpStr)
Result
['-', '_', '-', '_']
And like Bharath shetty has mentioned it in comment, do not use str, it's a reserved word in python for built-in strings
The following will give you your output.:
>>> import re
>>> str = '%(example_1).40s-%(example-2)_-%(example3)s_'
>>> print list("".join(re.findall("[-_]+(?![^(]*\))", str)))
['-', '_', '-', '_']
What this does is it finds all the substrings containing '-' and/or '_' in str and not in parentheses. Since these are substrings, we get all such matching characters by joining, and splitting into a list.

Finding consecutive consonants in a word

I need code that will show me the consecutive consonants in a word. For example, for "concertation" I need to obtain ["c","nc","rt","t","n"].
Here is my code:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes += x + ''
return consonnes
I manage to find the consonants, but I don't see how to find them consecutively. Can anybody tell me what I need to do?
You can use regular expressions, implemented in the re module
Better solution
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation", re.IGNORECASE)
['c', 'nc', 'rt', 't', 'n']
[bcdfghjklmnprstvyz]+ matches any sequence of one or more characters from the character class
re.IGNORECASE enables a case in sensitive match on the characters. That is
>>> re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "CONCERTATION", re.IGNORECASE)
['C', 'NC', 'RT', 'T', 'N']
Another Solution
>>> import re
>>> re.findall(r'[^aeiou]+', "concertation",)
['c', 'nc', 'rt', 't', 'n']
[^aeiou] Negated character class. Matches anything character other than the one in this character class. That is in short Matches consonents in the string
+ quantifer + matches one or more occurence of the pattern in the string
Note This will also find the non alphabetic, adjacent characters in the solution. As the character class is anything other than vowels
Example
>>> re.findall(r'[^aeiou]+', "123concertation",)
['123c', 'nc', 'rt', 't', 'n']
If you are sure that the input always contain alphabets, this solution is ok
re.findall(pattern, string, flags=0)
Return all non-overlapping matches of pattern in string, as a list of strings.
The string is scanned left-to-right, and matches are returned in the order found.
If you are curious about how the result is obtained for
re.findall(r'[bcdfghjklmnpqrstvwxyz]+', "concertation")
concertation
|
c
concertation
|
# o is not present in the character class. Matching ends here. Adds match, 'c' to ouput list
concertation
|
n
concertation
|
c
concertation
|
# Match ends again. Adds match 'nc' to list
# And so on
You could do this with regular expressions and the re module's split function:
>>> import re
>>> re.split(r"[aeiou]+", "concertation", flags=re.I)
['c', 'nc', 'rt', 't', 'n']
This method splits the string whenever one or more consecutive vowels are matched.
To explain the regular expression "[aeiou]+": here the vowels have been collected into a class [aeiou] while the + indicates that one or more occurrence of any character in this class can be matched. Hence the string "concertation" is split at o, e, a and io.
The re.I flag means that the case of the letters will be ignored, effectively making the character class equal to [aAeEiIoOuU].
Edit: One thing to keep in mind is that this method implicitly assumes that the word contains only vowels and consonants. Numbers and punctuation will be treated as non-vowels/consonants. To match only consecutive consonants, instead use re.findall with the consonants listed in the character class (as noted in other answers).
One useful shortcut to typing out all the consonants is to use the third-party regex module instead of re.
This module supports set operations, so the character class containing the consonants can be neatly written as the entire alphabet minus the vowels:
[[a-z]--[aeiou]] # equal to [bcdefghjklmnpqrstvwxyz]
Where [a-z] is the entire alphabet, -- is set difference and [aeiou] are the vowels.
If you are up for a non-regex solution, itertools.groupby would work perfectly fine here, like this
>>> from itertools import groupby
>>> is_vowel = lambda char: char in "aAeEiIoOuU"
>>> def suiteConsonnes(in_str):
... return ["".join(g) for v, g in groupby(in_str, key=is_vowel) if not v]
...
>>> suiteConsonnes("concertation")
['c', 'nc', 'rt', 't', 'n']
A really, really simple solution without importing anything is to replace the vowels with a single thing, then split on that thing:
def SuiteConsonnes(mot):
consonnes = ''.join([l if l not in "aeiou" else "0" for l in mot])
return [c for c in consonnes.split("0") if c is not '']
To keep it really similar to your code - and to add generators - we get this:
def SuiteConsonnes(mot):
consonnes=[]
for x in mot:
if x in "bcdfghjklmnprstvyz":
consonnes.append(x)
elif consonnes:
yield ''.join(consonnes)
consonnes = []
if consonnes: yield ''.join(consonnes)
def SuiteConsonnes(mot):
consonnes=[]
consecutive = '' # initialize consecutive string of consonants
for x in mot:
if x in "aeiou": # checks if x is not a consonant
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append consecutive string to consonnes
consecutive = '' # reinitialize consecutive for another consecutive string of consonants
else:
consecutive += x # add x to consecutive string if x is a consonant or not a vowel
if consecutive: # checks if consecutive string is not empty
consonnes.append(consecutive) # append last consecutive string of consonants
return consonnes
SuiteConsonnes('concertation')
#['c', 'nc', 'rt', 't', 'n']
Not that I'd recommend it for readability, but a one-line solution is:
In [250]: q = "concertation"
In [251]: [s for s in ''.join([l if l not in 'aeiou' else ' ' for l in q]).split()]
Out[251]: ['c', 'nc', 'rt', 't', 'n']
That is: join the non-vowels with spaces and split again on whitespace.
Use regular expressions from re built-in module:
import re
def find_consonants(string):
# find all non-vovels occuring 1 or more times:
return re.findall(r'[^aeiou]+', string)
Although I think you should go with #nu11p01n73R's answer, this will also work:
re.sub('[AaEeIiOoUu]+',' ','concertation').split()

Split a string in python with spaces and punctuations mark , unicode characters , etc.

I want to split string like this:
string = '[[he (∇((comesΦf→chem,'
based on spaces, punctuation marks also unicode characters. I mean, what I expect in output is in following mode:
out= ['[', '[', 'he',' ', '(','∇' , '(', '(', 'comes','Φ', 'f','→', 'chem',',']
I am using
re.findall(r"[\w\s\]+|[^\w\s]",String,re.unicode)
for this case, but it returned following output:
output=['[', '[', 'he',' ', '(', '\xe2', '\x88', '\x87', '(', '(', 'comes\xce', '\xa6', 'f\xe2', '\x86', '\x92', 'chem',',']
Please tell me how can i solve this problem.
Without using regexes and assuming words only contain ascii characters:
from string import ascii_letters
from itertools import groupby
LETTERS = frozenset(ascii_letters)
def is_alpha(char):
return char in LETTERS
def split_string(text):
for key, tokens in groupby(text, key=is_alpha):
if key: # Found letters, join them and yield a word
yield ''.join(tokens)
else: # not letters, just yield the single tokens
yield from tokens
Example result:
In [2]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[2]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comes', 'Φ', 'f', '→', 'chem', ',']
If you are using a python version less than 3.3 you can replace yield from tokens with:
for token in tokens: yield token
If you are on python2 keep in mind that split_string accepts a unicode string.
Note that modifying the is_alpha function you can define different kinds of grouping. For example if you wanted to considered all unicode letters as letters you could do: is_alpha = str.isalpha (or unicode.isalpha in python2):
In [3]: is_alpha = str.isalpha
In [4]: list(split_string('[[he (∇((comesΦf→chem,'))
Out[4]: ['[', '[', 'he', ' ', '(', '∇', '(', '(', 'comesΦf', '→', 'chem', ',']
Note the 'comesΦf' that before was splitted.
Hope i halp.
In [33]: string = '[[he (∇((comesΦf→chem,'
In [34]: re.split('\W+', string)
Out[34]: ['', 'he', 'comes', 'f', 'chem', '']

Python split a string using regex

I would like to split a string by ':' and ' ' characters. However, i would like to ignore two spaces ' ' and two colons '::'. for e.g.
text = "s:11011 i:11010 ::110011 :110010 d:11000"
should split into
[s,11011,i,11010,:,110011, ,110010,d,11000]
after following the Regular Expressions HOWTO on the python website, i managed to comeup with the following
regx= re.compile('([\s:]|[^\s\s]|[^::])')
regx.split(text)
However this does not work as intended as it splits on the : and spaces, but it still includes the ':' and ' ' in the split.
[s,:,11011, ,i,:,11010, ,:,:,110011, , :,110010, ,d,:,11000]
How can I fix this?
EDIT: In case of a double space, i only want one space to appear
Note this assumes that your data has format like X:101010:
>>> re.findall(r'(.+?):(.+?)\b ?',text)
[('s', '11011'), ('i', '11010'), (':', '110011'), (' ', '110010'), ('d', '11000')]
Then chain them up:
>>> list(itertools.chain(*_))
['s', '11011', 'i', '11010', ':', '110011', ' ', '110010', 'd', '11000']
>>> text = "s:11011 i:11010 ::110011 :110010 d:11000"
>>> [x for x in re.split(r":(:)?|\s(\s)?", text) if x]
['s', '11011', 'i', '11010', ':', '110011', ' ', '110010', 'd', '11000']
Use the regex (?<=\d) |:(?=\d) to split:
>>> text = "s:11011 i:11010 ::110011 :110010 d:11000"
>>> result = re.split(r"(?<=\d) |:(?=\d)", text)
>>> result
['s', '11011', 'i', '11010', ':', '110011', ' ', '110010', 'd', '11000']
This will split on:
(?<=\d) a space, when there is a digit on the left. To check this I use a lookbehind assertion.
:(?=\d) a colon, when there is a digit on the right. To check this I use a lookahead assertion.
Have a look at this pattern:
([a-z\:\s])\:(\d+)
It will give you the same array you are expecting. No need to use split, just access the matches you have returned by the regex engine.
Hope it helps!

Categories