I need to extract the word after the #
How can I do that? What I am trying:
text="Hello there #bob !"
user=text[text.find("#")+1:]
print user
output:
bob !
But the correct output should be:
bob
A regex solution for fun:
>>> import re
>>> re.findall(r'#(\w+)', '#Hello there #bob #!')
['Hello', 'bob']
>>> re.findall(r'#(\w+)', 'Hello there bob !')
[]
>>> (re.findall(r'#(\w+)', 'Hello there #bob !') or None,)[0]
'bob'
>>> print (re.findall(r'#(\w+)', 'Hello there bob !') or None,)[0]
None
The regex above will pick up patterns of one or more alphanumeric characters following an '#' character until a non-alphanumeric character is found.
Here's a regex solution to match one or more non-whitespace characters if you want to capture a broader range of substrings:
>>> re.findall(r'#(\S+?)', '#Hello there #bob #!')
['Hello', 'bob', '!']
Note that when the above regex encounters a string like #xyz#abc it will capture xyz#abc in one result instead of xyz and abc separately. To fix that, you can use the negated \s character class while also negating # characters:
>>> re.findall(r'#([^\s#]+)', '#xyz#abc some other stuff')
['xyz', 'abc']
And here's a regex solution to match one or more alphabet characters only in case you don't want any numbers or anything else:
>>> re.findall(r'#([A-Za-z]+)', '#Hello there #bobv2.0 #!')
['Hello', 'bobv']
So you want the word starting after # up to a whitespace?
user=text[text.find("#")+1:].split()[0]
print(user)
bob
EDIT: as #bgstech note, in cases where the string does not have a "#", make a check before:
if "#" in text:
user=text[text.find("#")+1:].split()[0]
else:
user="something_else_appropriate"
Related
Question: please debug logic to reflect expected output
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(word_list)
OUTPUT is :
['Hello', 'there', '.', '']
Problem: needs to be expected without space
Expected: ['Hello', 'there', '.']
First of all the actual output you shared is not right, it is ['Hello', ' ', 'there', '.', ''] because-
The \W, Matches anything other than a letter, digit or underscore. Equivalent to [^a-zA-Z0-9_] so it is splitting your string by space(\s) and literal dot(.) character
So if you want to get the expected output you need to do some further processing like the below-
With Earlier Code:
import re
s = "Hello there."
l = list(filter(str.strip,re.split(r"(\W+)", s)))
print(l)
With Edited code:
import re
text = "Hello there."
word_list = []
for word in text.split():
tmp = re.split(r'(\W+)', word)
word_list.extend(tmp)
print(list(filter(None,word_list)))
Output:
['Hello', 'there', '.']
Working Code: https://rextester.com/KWJN38243
assuming word is "Hello there.", the results make sense. See the split function documentation: Split string by the occurrences of pattern. If capturing parentheses are used in pattern, then the text of all groups in the pattern are also returned as part of the resulting list.
You have put capturing parenthesis in the pattern, so you are splitting the string on non-word characters, and also return the characters used for splitting.
Here is the string:
Hello there.
Here is how it is split:
Hello|there|
that means you have three values: hello there and an empty string '' in the last place.
And the values you split on are a space and a period
So the output should be the three values and the two characters that we split on:
hello - space - there - period - empty string
which is exactly what I get.
import re
s = "Hello there."
t = re.split(r"(\W+)", s)
print(t)
output:
['Hello', ' ', 'there', '.', '']
Further Explanation
From your question is may be that you think because the string ends with a non-word character that there would be nothing "after" it, but this is not how splitting works. If you think back to CSV files (which have been around forever, and consider a CSV file like this:
date,product,qty,price
20220821,P1,10,20.00
20220821,P2,10,
The above represents a csv file with four fields, but in line two the last field (which definitely exists) is missing. And it would be parsed as an empty string if we split on the comma.
For example we have this text:
Hello but I don't want1 this non-object word in it.
Using regular expression, how can extract words that must start with a letter and that only have letters or numbers in it? For example in this example I only want:
Hello but I want1 this word in it
Any help would be appreciated! Thanks!
You can use lookarounds in your regex:
>>> str = "Hello but I don't want1 this non-object word in it."
>>> print re.findall(r'(?:(?<=\s)|(?<=^))\w+(?=[.\s]|$)', str)
['Hello', 'but', 'I', 'want1', 'this', 'word', 'in', 'it']
RegEx Demo
extract words that must start with a letter and that only have letters or
numbers in it
The alternative solution using re.sub function(from re module):
s = "Hello but I don't want this non-object word in it."
s = re.sub(r'\s?\b[a-zA-Z]+?[^\w ][\w]+?\b', '', s)
print(s)
The output:
Hello but I want this word in it.
I already have the following regular expression:
'([A-Z0-9]{1,4}(?![A-Z0-9]))'
that meets the following requirements.
1-4 Characters in length
All Uppercase
Can be a mix of numbers and
letters
Now Say I have this string "This is A test of a TREE, HOUSE"
result = ['T', 'A', 'TREE']
I don't want the first 'T' because it is not on it's own and is part of a word.
How would I go about modifying the re search to account for this?
Thanks
[Edit: Spelling]
You can use word boundaries \b around your pattern.
>>> import re
>>> s = 'This is A test of a TREE, HOUSE'
>>> re.findall(r'\b[A-Z0-9]{1,4}\b', s)
['A', 'TREE']
I tried separate m's in a python regex by using word boundaries and find them all. These m's should either have a whitespace on both sides or begin/end the string:
r = re.compile("\\bm\\b")
re.findall(r, someString)
However, this method also finds m's within words like I'm since apostrophes are considered to be word boundaries. How do I write a regex that doesn't consider apostrophes as word boundaries?
I've tried this:
r = re.compile("(\\sm\\s) | (^m) | (m$)")
re.findall(r, someString)
but that just doesn't match any m. Odd.
Using lookaround assertion:
>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I'm a boy")
[]
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "I m a boy")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "pm")
['m']
(?=...)
Matches if ... matches next, but doesn’t consume any of the
string. This is called a lookahead assertion. For example, Isaac
(?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.
(?<=...)
Matches if the current position in the string is preceded by a match
for ... that ends at the current position. This is called a positive
lookbehind assertion. (?<=abc)def will find a match in abcdef, ...
from Regular expression syntax
BTW, using raw string (r'this is raw string'), you don't need to escape \.
>>> r'\s' == '\\s'
True
You don't even need look-around (unless you want to capture the m without the spaces), but your second example was inches away. It was the extra spaces (ok in python, but not within a regex) which made them not work:
>>> re.findall(r'\sm\s|^m|m$', "I m a boy")
[' m ']
>>> re.findall(r'\sm\s|^m|m$', "mamam")
['m', 'm']
>>> re.findall(r'\sm\s|^m|m$', "mama")
['m']
>>> re.findall(r'\sm\s|^m|m$', "I'm a boy")
[]
>>> re.findall(r'\sm\s|^m|m$', "I'm a boym")
['m']
falsetru's answer is almost the equivalent of "\b except apostrophes", but not quite. It will still find matches where a boundary is missing. Using one of falsetru's examples:
>>> import re
>>> re.findall(r'(?<=\s)m(?=\s)|^m|m$', "mama")
['m']
It finds 'm', but there is no occurrence of 'm' in 'mama' that would match '\bm\b'. The first 'm' matches '\bm', but that's as close as it gets.
The regex that implements "\b without apostrophes" is shown below:
(?<=\s)m(?=\s)|^m(?=\s)|(?<=\s)m$|^m$
This will find any of the following 4 cases:
'm' with white space before and after
'm' at beginning followed by white space
'm' at end preceded by white space
'm' with nothing preceding or following it (i.e. just literally the string "m")
Here is the piece of code where i want help.
listword=["os","slow"]
sentence="photos"
if any(word in sentence for word in listword):
print "yes"
It prints yes as os is present in photos.
But I want to know whether there is os as a "word" is present in the string instead of os present as part of the word.Is there any way without converting sentence into list of words.Basically i dont want the program to print yes.It has to print yes only if string contains os word.
Thanks
You'd need to use regular expressions, and add \b word boundary anchors around each word when matching:
import re
if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
print 'yes'
The \b boundary anchor matches at string start and end points, and anywhere there is a transition between word and non-word characters (so between a space and a letter or digit, or between punctuation and a letter or digit).
The re.escape() function ensures that all regular expression metacharacters are escaped and we match on the literal contents of word and not accidentally interpret anything in there as an expression.
Demo:
>>> listword = ['foo', 'bar', 'baz']
>>> sentence = 'The quick fox jumped over the barred door'
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
... print 'yes'
...
>>> sentence = 'The tradition to use fake names like foo, bar or baz originated at MIT'
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence) for word in listword):
... print 'yes'
...
yes
By using a regular expression, you now can match case-insensitively as well:
if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence, re.I) for word in listword):
print 'yes'
In this demo both the and mit qualify even though the case in the sentence differs:
>>> listword = ['the', 'mit']
>>> if any(re.search(r'\b{}\b'.format(re.escape(word)), sentence, re.I) for word in listword):
... print 'yes'
...
yes
As people have pointed out, you can use regular expressions to split your string into a list words. This is known as tokenization.
If regular expressions aren't working well enough for you, then I suggest having a look at NTLK -- a Python natural language processing library. It contains a wide range of tokenizers that will split your string based on whitespace, punctuation, and other features that may be too tricky to capture with a regex.
Example:
>>> from nltk.tokenize import word_tokenize, wordpunct_tokenize, sent_tokenize
>>> s = '''Good muffins cost $3.88\nin New York. Please buy me
... two of them.\n\nThanks.'''
>>> wordpunct_tokenize(s)
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York', '.',
'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
>>> "buy" in wordpunct_tokenize(s)
True
This is simple, and will not work if sentence string contains commas, but still
if any (" {0} ".format a in sentence for a in listword):
>>> sentence="photos"
>>> listword=["os","slow"]
>>> pat = r'|'.join(r'\b{0}\b'.format(re.escape(x)) for x in listword)
>>> bool(re.search(pat, sentence))
False
>>> listword=["os","slow", "photos"]
>>> pat = r'|'.join(r'\b{0}\b'.format(re.escape(x)) for x in listword)
>>> bool(re.search(pat, sentence))
True
While I especially like the tokenizer and the regular expression solutions, I do believe they are kind of overkill for this kind of situation, which can be very effectively solved just by using the str.find() method.
listword = ['os', 'slow']
sentence = 'photos'
for word in listword:
if sentence.find(word) != -1:
print 'yes'
Although this might not be the most elegant solution, it still is (in my opinion) the most suitable solution for people that just started out fiddling with the language.