Python: Grab each word after certain character in string - python

I want to grab each word that has a + before it
If I input the string:
word anotherword +aspecialword lameword +heythisone +test hello
I want it to return:
aspecialword heythisone test

Have a split combined with a list comp
>>> a = 'word anotherword +aspecialword lameword +heythisone +test hello'
>>> [i[1:] for i in a.split() if i[0] == '+']
['aspecialword', 'heythisone', 'test']

try like this:
>>> my_str = "word anotherword +aspecialword lameword +heythisone +test hello"
>>> " ".join(x[1:] for x in my_str.split() if x.startswith("+"))
'aspecialword heythisone test'
str.startswith(prefix[, start[, end]])
Return True if string starts with the prefix, otherwise return False. prefix can also be a tuple of prefixes to look for. With optional start, test string beginning at that position. With optional end, stop comparing string at that position.

You could use a regular expression.
>>> import re
>>> re.findall(r'(?<=\+)\S+', "word anotherword +aspecialword lameword +heythisone +test hello")
['aspecialword', 'heythisone', 'test']
r'(?<=\+)\S+' matches any sequence of non-space characters that are preceded by a plus sign.

Related

Get a string after a character in python

Getting a string that comes after a '%' symbol and should end before other characters (no numbers and characters).
for example:
string = 'Hi %how are %YOU786$ex doing'
it should return as a list.
['how', 'you']
I tried
string = text.split()
sample = []
for i in string:
if '%' in i:
sample.append(i[1:index].lower())
return sample
but it I don't know how to get rid of 'you786$ex'.
EDIT: I don't want to import re
You can use a regular expression.
>>> import re
>>>
>>> s = 'Hi %how are %YOU786$ex doing'
>>> re.findall('%([a-z]+)', s.lower())
>>> ['how', 'you']
regex101 details
This can be most easily done with re.findall():
import re
re.findall(r'%([a-z]+)', string.lower())
This returns:
['how', 'you']
Or you can use str.split() and iterate over the characters:
sample = []
for token in string.lower().split('%')[1:]:
word = ''
for char in token:
if char.isalpha():
word += char
else:
break
sample.append(word)
sample would become:
['how', 'you']
Use Regex (Regular Expressions).
First, create a Regex pattern for your task. You could use online tools to test it. See regex for your task: https://regex101.com/r/PMSvtK/1
Then just use this regex in Python:
import re
def parse_string(string):
return re.findall("\%([a-zA-Z]+)", string)
print(parse_string('Hi %how are %YOU786$ex doing'))
Output:
['how', 'YOU']

Python re: if string has one word AND any one of a list of words?

I want to find if a string matches on this rule using a regular expression:
list_of_words = ['a', 'boo', 'blah']
if 'foo' in temp_string and any(word in temp_string for word in list_of_words)
The reason I want it in a regular expression is that I have hundreds of rules like it and different from it so I want to save them all as patterns in a dict.
The only one I could think of is this but it doesn't seem pretty:
re.search(r'foo.*(a|boo|blah)|(a|boo|blah).*foo')
You can join the array elements using | to construct a lookahead assertion regex:
>>> list_of_words = ['a', 'boo', 'blah']
>>> reg = re.compile( r'^(?=.*\b(?:' + "|".join(list_of_words) + r')\b).*foo' )
>>> print reg.pattern
^(?=.*\b(?:a|boo|blah)\b).*foo
>>> reg.findall(r'abcd foo blah')
['abcd foo']
As you can see we have constructed a regex ^(?=.*\b(?:a|boo|blah)\b).*foo which asserts presence of one word from list_of_words and matches foo anywhere.

Match words that are not inside the characters < > using regular expressions

I am trying to match words that are not inside < >.
This is the regular expression for matching words inside < >:
text = " Hi <how> is <everything> going"
pattern_neg = r'<([A-Za-z0-9_\./\\-]*)>'
m = re.findall(pattern_neg, text)
# m is ['how', 'everything']
I want the result to be ['Hi', 'is', 'going'].
Using re.split:
import re
text = " Hi <how> is <everything> going"
[s.strip() for s in re.split('\s*<.*?>\s*', text)]
>> ['Hi', 'is', 'going']
A regular expression approach:
>>> import re
>>> re.findall(r"\b(?<!<)\w+(?!>)\b", text)
['Hi', 'is', 'going']
Where \b are the word boundaries, (?<!<) is a negative lookbehind and (?!>) a negative lookahead, \w+ would match one or more alphanumeric characters.
A non-regex naive approach (splitting by space, checking if each word not starts with < and not ends with >):
>>> [word for word in text.split() if not word.startswith("<") and not word.endswith(">")]
['Hi', 'is', 'going']
To also handle the <hello how> are you case, we would need something different:
>>> text = " Hi <how> is <everything> going"
>>> re.findall(r"(?:^|\s)(?!<)([\w\s]+)(?!>)(?:\s|$)", text)
[' Hi', 'is', 'going']
>>> text = "<hello how> are you"
>>> re.findall(r"(?:^|\s)(?!<)([\w\s]+)(?!>)(?:\s|$)", text)
['are you']
Note that are you now have to be splitted to get individual words.

Python how to strip a string from a string based on items in a list

I have a list as shown below:
exclude = ["please", "hi", "team"]
I have a string as follows:
text = "Hi team, please help me out."
I want my string to look as:
text = ", help me out."
effectively stripping out any word that might appear in the list exclude
I tried the below:
if any(e in text.lower()) for e in exclude:
print text.lower().strip(e)
But the above if statement returns a boolean value and hence I get the below error:
NameError: name 'e' is not defined
How do I get this done?
Something like this?
>>> from string import punctuation
>>> ' '.join(x for x in (word.strip(punctuation) for word in text.split())
if x.lower() not in exclude)
'help me out
If you want to keep the trailing/leading punctuation with the words that are not present in exclude:
>>> ' '.join(word for word in text.split()
if word.strip(punctuation).lower() not in exclude)
'help me out.'
First one is equivalent to:
>>> out = []
>>> for word in text.split():
word = word.strip(punctuation)
if word.lower() not in exclude:
out.append(word)
>>> ' '.join(out)
'help me out'
You can use Use this (remember it is case sensitive)
for word in exclude:
text = text.replace(word, "")
This is going to replace with spaces everything that is not alphanumeric or belong to the stopwords list, and then split the result into the words you want to keep. Finally, the list is joined into a string where words are spaced. Note: case sensitive.
' '.join ( re.sub('\W|'+'|'.join(stopwords),' ',sentence).split() )
Example usage:
>>> import re
>>> stopwords=['please','hi','team']
>>> sentence='hi team, please help me out.'
>>> ' '.join ( re.sub('\W|'+'|'.join(stopwords),' ',sentence).split() )
'help me out'
Using simple methods:
import re
exclude = ["please", "hi", "team"]
text = "Hi team, please help me out."
l=[]
te = re.findall("[\w]*",text)
for a in te:
b=''.join(a)
if (b.upper() not in (name.upper() for name in exclude)and a):
l.append(b)
print " ".join(l)
Hope it helps
if you are not worried about punctuation:
>>> import re
>>> text = "Hi team, please help me out."
>>> text = re.findall("\w+",text)
>>> text
['Hi', 'team', 'please', 'help', 'me', 'out']
>>> " ".join(x for x in text if x.lower() not in exclude)
'help me out'
In the above code, re.findall will find all words and put them in a list.
\w matches A-Za-z0-9
+ means one or more occurrence

How to remove non-alphanumeric characters at the beginning or end of a string

I have a list with elements that have unnecessary (non-alphanumeric) characters at the beginning or end of each string.
Ex.
'cats--'
I want to get rid of the --
I tried:
for i in thelist:
newlist.append(i.strip('\W'))
That didn't work. Any suggestions.
def strip_nonalnum(word):
if not word:
return word # nothing to strip
for start, c in enumerate(word):
if c.isalnum():
break
for end, c in enumerate(word[::-1]):
if c.isalnum():
break
return word[start:len(word) - end]
print([strip_nonalnum(s) for s in thelist])
Or
import re
def strip_nonalnum_re(word):
return re.sub(r"^\W+|\W+$", "", word)
To remove one or more chars other than letters, digits and _ from both ends you may use
re.sub(r'^\W+|\W+$', '', '??cats--') # => cats
Or, if _ is to be removed, too, wrap \W into a character class and add _ there:
re.sub(r'^[\W_]+|[\W_]+$', '', '_??cats--_')
See the regex demo and the regex graph:
See the Python demo:
import re
print( re.sub(r'^\W+|\W+$', '', '??cats--') ) # => cats
print( re.sub(r'^[\W_]+|[\W_]+$', '', '_??cats--_') ) # => cats
You can use a regex expression. The method re.sub() will take three parameters:
The regex expression
The replacement
The string
Code:
import re
s = 'cats--'
output = re.sub("[^\\w]", "", s)
print output
Explanation:
The part "\\w" matches any alphanumeric character.
[^x] will match any character that is not x
I believe that this is the shortest non-regex solution:
text = "`23`12foo--=+"
while len(text) > 0 and not text[0].isalnum():
text = text[1:]
while len(text) > 0 and not text[-1].isalnum():
text = text[:-1]
print text
By using strip you have to know the substring to be stripped.
>>> 'cats--'.strip('-')
'cats'
You could use re to get rid of the non-alphanumeric characters but you would shoot with a cannon on a mouse IMO. With str.isalpha() you can test any strings to contain alphabetic characters, so you only need to keep those:
>>> ''.join(char for char in '#!cats-%' if char.isalpha())
'cats'
>>> thelist = ['cats5--', '#!cats-%', '--the#!cats-%', '--5cats-%', '--5!cats-%']
>>> [''.join(c for c in e if c.isalpha()) for e in thelist]
['cats', 'cats', 'thecats', 'cats', 'cats']
You want to get rid of non-alphanumeric so we can make this better:
>>> [''.join(c for c in e if c.isalnum()) for e in thelist]
['cats5', 'cats', 'thecats', '5cats', '5cats']
This one is exactly the same result you would get with re (as of Christian's answer):
>>> import re
>>> [re.sub("[^\\w]", "", e) for e in thelist]
['cats5', 'cats', 'thecats', '5cats', '5cats']
However, If you want to strip non-alphanumeric characters from the end of the strings only you should use another pattern like this one (check re Documentation):
>>> [''.join(re.search('^\W*(.+)(?!\W*$)(.)', e).groups()) for e in thelist]
['cats5', 'cats', 'the#!cats', '5cats', '5!cats']

Categories