I am trying to match words that are not inside < >.
This is the regular expression for matching words inside < >:
text = " Hi <how> is <everything> going"
pattern_neg = r'<([A-Za-z0-9_\./\\-]*)>'
m = re.findall(pattern_neg, text)
# m is ['how', 'everything']
I want the result to be ['Hi', 'is', 'going'].
Using re.split:
import re
text = " Hi <how> is <everything> going"
[s.strip() for s in re.split('\s*<.*?>\s*', text)]
>> ['Hi', 'is', 'going']
A regular expression approach:
>>> import re
>>> re.findall(r"\b(?<!<)\w+(?!>)\b", text)
['Hi', 'is', 'going']
Where \b are the word boundaries, (?<!<) is a negative lookbehind and (?!>) a negative lookahead, \w+ would match one or more alphanumeric characters.
A non-regex naive approach (splitting by space, checking if each word not starts with < and not ends with >):
>>> [word for word in text.split() if not word.startswith("<") and not word.endswith(">")]
['Hi', 'is', 'going']
To also handle the <hello how> are you case, we would need something different:
>>> text = " Hi <how> is <everything> going"
>>> re.findall(r"(?:^|\s)(?!<)([\w\s]+)(?!>)(?:\s|$)", text)
[' Hi', 'is', 'going']
>>> text = "<hello how> are you"
>>> re.findall(r"(?:^|\s)(?!<)([\w\s]+)(?!>)(?:\s|$)", text)
['are you']
Note that are you now have to be splitted to get individual words.
Related
As simple as it sounds, can't think of a straightforward way of doing the below in Python.
my_string = "This is a test.\nAlso\tthis"
list_i_want = ["This", "is", "a", "test.", "\n", "Also", "this"]
I need the same behaviour as with string.split(), i.e. remove any type and number of whitespaces, but excluding the line breaks \n in which case I need it as a standalone list item.
How could I do this?
Split String using Regex findall()
import re
my_string = "This is a test.\nAlso\tthis"
my_list = re.findall(r"\S+|\n", my_string)
print(my_list)
How it Works:
"\S+": "\S" = non whitespace characters. "+" is a greed quantifier so it find any groups of non-whitespace characters aka words
"|": OR logic
"\n": Find "\n" so it's returned as well in your list
Output:
['This', 'is', 'a', 'test.', '\n', 'Also', 'this']
Here's a code that works but is definitely not efficient/pythonic:
my_string = "This is a test.\nAlso\tthis"
l = my_string.splitlines() #Splitting lines
list_i_want = []
for i in l:
list_i_want.extend((i.split())) # Extending elements in list by splitting lines
list_i_want.extend('\n') # adding newline character
list_i_want.pop() # Removing last newline character
print(list_i_want)
Output:
['This', 'is', 'a', 'test.', '\n', 'Also', 'this']
Getting a string that comes after a '%' symbol and should end before other characters (no numbers and characters).
for example:
string = 'Hi %how are %YOU786$ex doing'
it should return as a list.
['how', 'you']
I tried
string = text.split()
sample = []
for i in string:
if '%' in i:
sample.append(i[1:index].lower())
return sample
but it I don't know how to get rid of 'you786$ex'.
EDIT: I don't want to import re
You can use a regular expression.
>>> import re
>>>
>>> s = 'Hi %how are %YOU786$ex doing'
>>> re.findall('%([a-z]+)', s.lower())
>>> ['how', 'you']
regex101 details
This can be most easily done with re.findall():
import re
re.findall(r'%([a-z]+)', string.lower())
This returns:
['how', 'you']
Or you can use str.split() and iterate over the characters:
sample = []
for token in string.lower().split('%')[1:]:
word = ''
for char in token:
if char.isalpha():
word += char
else:
break
sample.append(word)
sample would become:
['how', 'you']
Use Regex (Regular Expressions).
First, create a Regex pattern for your task. You could use online tools to test it. See regex for your task: https://regex101.com/r/PMSvtK/1
Then just use this regex in Python:
import re
def parse_string(string):
return re.findall("\%([a-zA-Z]+)", string)
print(parse_string('Hi %how are %YOU786$ex doing'))
Output:
['how', 'YOU']
I want to grab each word that has a + before it
If I input the string:
word anotherword +aspecialword lameword +heythisone +test hello
I want it to return:
aspecialword heythisone test
Have a split combined with a list comp
>>> a = 'word anotherword +aspecialword lameword +heythisone +test hello'
>>> [i[1:] for i in a.split() if i[0] == '+']
['aspecialword', 'heythisone', 'test']
try like this:
>>> my_str = "word anotherword +aspecialword lameword +heythisone +test hello"
>>> " ".join(x[1:] for x in my_str.split() if x.startswith("+"))
'aspecialword heythisone test'
str.startswith(prefix[, start[, end]])
Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.
You could use a regular expression.
>>> import re
>>> re.findall(r'(?<=\+)\S+', "word anotherword +aspecialword lameword +heythisone +test hello")
['aspecialword', 'heythisone', 'test']
r'(?<=\+)\S+' matches any sequence of non-space characters that are preceded by a plus sign.
I have a list as shown below:
exclude = ["please", "hi", "team"]
I have a string as follows:
text = "Hi team, please help me out."
I want my string to look as:
text = ", help me out."
effectively stripping out any word that might appear in the list exclude
I tried the below:
if any(e in text.lower()) for e in exclude:
print text.lower().strip(e)
But the above if statement returns a boolean value and hence I get the below error:
NameError: name 'e' is not defined
How do I get this done?
Something like this?
>>> from string import punctuation
>>> ' '.join(x for x in (word.strip(punctuation) for word in text.split())
if x.lower() not in exclude)
'help me out
If you want to keep the trailing/leading punctuation with the words that are not present in exclude:
>>> ' '.join(word for word in text.split()
if word.strip(punctuation).lower() not in exclude)
'help me out.'
First one is equivalent to:
>>> out = []
>>> for word in text.split():
word = word.strip(punctuation)
if word.lower() not in exclude:
out.append(word)
>>> ' '.join(out)
'help me out'
You can use Use this (remember it is case sensitive)
for word in exclude:
text = text.replace(word, "")
This is going to replace with spaces everything that is not alphanumeric or belong to the stopwords list, and then split the result into the words you want to keep. Finally, the list is joined into a string where words are spaced. Note: case sensitive.
' '.join ( re.sub('\W|'+'|'.join(stopwords),' ',sentence).split() )
Example usage:
>>> import re
>>> stopwords=['please','hi','team']
>>> sentence='hi team, please help me out.'
>>> ' '.join ( re.sub('\W|'+'|'.join(stopwords),' ',sentence).split() )
'help me out'
Using simple methods:
import re
exclude = ["please", "hi", "team"]
text = "Hi team, please help me out."
l=[]
te = re.findall("[\w]*",text)
for a in te:
b=''.join(a)
if (b.upper() not in (name.upper() for name in exclude)and a):
l.append(b)
print " ".join(l)
Hope it helps
if you are not worried about punctuation:
>>> import re
>>> text = "Hi team, please help me out."
>>> text = re.findall("\w+",text)
>>> text
['Hi', 'team', 'please', 'help', 'me', 'out']
>>> " ".join(x for x in text if x.lower() not in exclude)
'help me out'
In the above code, re.findall will find all words and put them in a list.
\w matches A-Za-z0-9
+ means one or more occurrence
I have a list with elements that have unnecessary (non-alphanumeric) characters at the beginning or end of each string.
Ex.
'cats--'
I want to get rid of the --
I tried:
for i in thelist:
newlist.append(i.strip('\W'))
That didn't work. Any suggestions.
def strip_nonalnum(word):
if not word:
return word # nothing to strip
for start, c in enumerate(word):
if c.isalnum():
break
for end, c in enumerate(word[::-1]):
if c.isalnum():
break
return word[start:len(word) - end]
print([strip_nonalnum(s) for s in thelist])
Or
import re
def strip_nonalnum_re(word):
return re.sub(r"^\W+|\W+$", "", word)
To remove one or more chars other than letters, digits and _ from both ends you may use
re.sub(r'^\W+|\W+$', '', '??cats--') # => cats
Or, if _ is to be removed, too, wrap \W into a character class and add _ there:
re.sub(r'^[\W_]+|[\W_]+$', '', '_??cats--_')
See the regex demo and the regex graph:
See the Python demo:
import re
print( re.sub(r'^\W+|\W+$', '', '??cats--') ) # => cats
print( re.sub(r'^[\W_]+|[\W_]+$', '', '_??cats--_') ) # => cats
You can use a regex expression. The method re.sub() will take three parameters:
The regex expression
The replacement
The string
Code:
import re
s = 'cats--'
output = re.sub("[^\\w]", "", s)
print output
Explanation:
The part "\\w" matches any alphanumeric character.
[^x] will match any character that is not x
I believe that this is the shortest non-regex solution:
text = "`23`12foo--=+"
while len(text) > 0 and not text[0].isalnum():
text = text[1:]
while len(text) > 0 and not text[-1].isalnum():
text = text[:-1]
print text
By using strip you have to know the substring to be stripped.
>>> 'cats--'.strip('-')
'cats'
You could use re to get rid of the non-alphanumeric characters but you would shoot with a cannon on a mouse IMO. With str.isalpha() you can test any strings to contain alphabetic characters, so you only need to keep those:
>>> ''.join(char for char in '#!cats-%' if char.isalpha())
'cats'
>>> thelist = ['cats5--', '#!cats-%', '--the#!cats-%', '--5cats-%', '--5!cats-%']
>>> [''.join(c for c in e if c.isalpha()) for e in thelist]
['cats', 'cats', 'thecats', 'cats', 'cats']
You want to get rid of non-alphanumeric so we can make this better:
>>> [''.join(c for c in e if c.isalnum()) for e in thelist]
['cats5', 'cats', 'thecats', '5cats', '5cats']
This one is exactly the same result you would get with re (as of Christian's answer):
>>> import re
>>> [re.sub("[^\\w]", "", e) for e in thelist]
['cats5', 'cats', 'thecats', '5cats', '5cats']
However, If you want to strip non-alphanumeric characters from the end of the strings only you should use another pattern like this one (check re Documentation):
>>> [''.join(re.search('^\W*(.+)(?!\W*$)(.)', e).groups()) for e in thelist]
['cats5', 'cats', 'the#!cats', '5cats', '5!cats']