This question already has answers here:
How to get a string after a specific substring?
(9 answers)
Closed 2 months ago.
I want to find words that appear after a keyword (specified and searched by me) and print out the result. I know that i am suppose to use regex to do it, and i tried it out too, like this:
import re
s = "hi my name is ryan, and i am new to python and would like to learn more"
m = re.search("^name: (\w+)", s)
print m.groups()
The output is just:
"is"
But I want to get all the words and punctuations that comes after the word "name".
Instead of using regexes you could just (for example) separate your string with str.partition(separator) like this:
mystring = "hi my name is ryan, and i am new to python and would like to learn more"
keyword = 'name'
before_keyword, keyword, after_keyword = mystring.partition(keyword)
>>> before_keyword
'hi my '
>>> keyword
'name'
>>> after_keyword
' is ryan, and i am new to python and would like to learn more'
You have to deal with the needless whitespaces separately, though.
Your example will not work, but as I understand the idea:
regexp = re.compile("name(.*)$")
print regexp.search(s).group(1)
# prints " is ryan, and i am new to python and would like to learn more"
This will print all after "name" and till end of the line.
An other alternative...
import re
m = re.search('(?<=name)(.*)', s)
print m.groups()
import re
s = "hi my name is ryan, and i am new to python and would like to learn more"
m = re.search("^name: (\w+)", s)
print m.group(1)
Instead of "^name: (\w+)" use:
"^name:(.*)"
What you have used regarding your output:
re.search("name (\w+)", s)
What you have to use (match all):
re.search("name (.*)", s)
You could simply do
s = "hi my name is ryan, and i am new to python and would like to learn more"
s.split('name')
This will split your string and return a list like this ['hi my', 'is ryan, and i am new to python and would like to learn more']
depending on what you want to do this may help or not.
This will work out for u : work name\s\w+\s(\w+)
>>> s = 'hi my name is ryan, and i am new to python and would like to learn more'
>>> m = re.search('name\s\w+\s(\w+)',s)
>>> m.group(0)
'name is ryan'
>>>> m.group(1)
'ryan'
Without using regex, you can
strip punctuation (consider making everything single case, including search term)
split your text into individual words
find index of searched word
get word from array (index + 1 for word after, index - 1 for word before )
Code snippet:
import string
s = 'hi my name is ryan, and i am new to python and would like to learn more'
t = 'name'
i = s.translate(string.maketrans("",""), string.punctuation).split().index(t)
print s.split()[i+1]
>> is
For multiple occurences, you need to save multiple indices:
import string
s = 'hi my NAME is ryan, and i am new to NAME python and would like to learn more'
t = 'NAME'
il = [i for i, x in enumerate(s.translate(string.maketrans("",""), string.punctuation).split()) if x == t]
print [s.split()[x+1] for x in il]
>> ['is', 'python']
Related
I have a list l.
l = ["This is","'the first 'string","and 'it is 'good"]
I want to replace all the whitespaces with "|space|" in strings that are within 's.
print (l)
# ["This is","'the|space|first|space|'string","and 'it|space|is|space|'good"]
I can't use a for loop inside a for loop and directly use .replace() as strings are not mutable
TypeError: 'str' object does not support item assignment
I have seen the below questions and none of them have helped me.
Replacing string element in for loop Python (3 answers)
Running replace() method in a for loop? (3 answers)
Replace strings using List Comprehensions (7 answers)
I have considered using re.sub but can't think of a suitable regular expression that does the job.
This works for me:
>>> def replace_spaces(str) :
... parts = str.split("'")
... for i in range(1,len(parts),2) :
... parts[i] = parts[i].replace(' ', '|')
... return "'".join( parts )
...
>>> [replace_spaces(s) for s in l]
['This is', "'the|first|'string", "and 'it|is|'good"]
>>>
I think I have solved your replacing problem with regex. You might have to polish the given code snippet a bit more to suit your need.
If I understood the question correctly, the trick was to use a regular expression to find the right space to be replaced.
match = re.findall(r"\'(.+?)\'", k) #here k is an element in list.
Placing skeleton code for your reference:
import re
l = ["This is","'the first 'string","and 'it is 'good"]
#declare output
for k in l:
match = re.findall(r"\'(.+?)\'", k)
if not match:
#append k itself to your output
else:
p = (str(match).replace(' ', '|space|'))
#append p to your output
I haven't tested it yet, but it should work. Let me know if you face any issues with this.
Using regex text-munging :
import re
l = ["This is","'the first 'string","and 'it is 'good"]
def repl(m):
return m.group(0).replace(r' ', '|space|')
l_new = []
for item in l:
quote_str = r"'.+'"
l_new.append(re.sub(quote_str, repl, item))
print(l_new)
Output:
['This is', "'the|space|first|space|'string", "and 'it|space|is|space|'g
ood"]
Full logic is basically:
Loop through elements of l.
Find the string between single quotes. Pass that to repl function.
repl function I'm using simple replace to replace spaces with |space| .
Reference for text-munging => https://docs.python.org/3/library/re.html#text-munging
I am trying to remove all the single characters in a string
input: "This is a big car and it has a spacious seats"
my output should be:
output: "This is big car and it has spacious seats"
Here I am using the expression
import re
re.compile('\b(?<=)[a-z](?=)\b')
This matches with first single character in the string ...
Any help would be appreciated ...thanks in Advance
Edit: I have just seen that this was suggested in the comments first by Wiktor Stribiżew. Credit to him - I had not seen when this was posted.
You can also use re.sub() to automatically remove single characters (assuming you only want to remove alphabetical characters). The following will replace any occurrences of a single alphabetical character:
import re
input = "This is a big car and it has a spacious seats"
output = re.sub(r"\b[a-zA-Z]\b", "", input)
>>>
output = "This is big car and it has spacious seats"
You can learn more about inputting regex expression when replacing strings here: How to input a regex in string.replace?
Here's one way to do it by splitting the string and filtering out single length letters using len and str.isalpha:
>>> s = "1 . This is a big car and it has a spacious seats"
>>> ' '.join(i for i in s.split() if not (i.isalpha() and len(i)==1))
'1 . This is big car and it has spacious seats'
re.sub(r' \w{1} |^\w{1} | \w{1}$', ' ', input)
EDIT:
You can use:
import re
input_string = "This is a big car and it has a spacious seats"
str_without_single_chars = re.sub(r'(?:^| )\w(?:$| )', ' ', input_string).strip()
or (which as was brought to my attention, doesn't meet the specifications):
input_string = "This is a big car and it has a spacious seats"
' '.join(w for w in input_string.split() if len(w)>3)
The fastest way to remove words, characters, strings or anything between two known tags or two known characters in a string is by using a direct and Native C approach using RE along with a Common as shown below.
var = re.sub('<script>', '<!--', var)
var = re.sub('</script>', '-->', var)
#And finally
var = re.sub('<!--.*?-->', '', var)
It removes everything and works faster, better and cleaner than Beautiful Soup.
Batch files are where the "" got there beginnings and were only borrowed for use with batch and html from native C". When using all Pythonic methods with regular expressions you have to realize that Python has not altered or changed much from all regular expressions used by Machine Language so why iterate many times when a single loop can find it all as one chunk in one iteration? Do the same individually with Characters also.
var = re.sub('\[', '<!--', var)
var = re.sub('\]', '-->', var)
And finally
var = re.sub('<!--.*?-->', '' var)# wipes it all out from between along with.
And you do not need Beautiful Soup. You can also scalp data using them if you understand how this works.
I am writing a script that introduces misspellings into sentence. I am using python re module to replace the original word with the misspelling. The script looks like this:
# replacing original word by error
pattern = re.compile(r'%s' % original_word)
replace_by = r'\1' + err
modified_sentence = re.sub(pattern, replace_by, sentence, count=1)
But the problem is this will replace even if original_word was part of another word for example:
If i had
original_word = 'in'
err = 'il'
sentence = 'eating food in'
it would replace the occurrence of 'in' in eating like:
> 'eatilg food in'
I was checking in the re documentation but it doesn't give any example on how to include regex options, for example:
If my pattern is:
regex_pattern = '\b%s\b' % original_word
this would solve the problem as \b represents 'word boundary'. But it doesn't seem to work.
I tried to find to find a work around it by doing:
pattern = re.compile(r'([^\w])%s' % original_word)
but that does not work. For example :
original_word = 'to'
err = 'vo'
sentence = 'I will go tomorrow to the'
it replaces it to:
> I will go vomorrow to the
Thank you, any help appreciated
See here for an example of word boundaries in python re module. It looks like you were close just need to put it all together. The following script gives you the output you want...
import re
original_word = 'to'
err = 'vo'
sentence = 'I will go tomorrow to the'
pattern = re.compile(r'\b%s\b' % re.escape(original_word))
modified_sentence = re.sub(pattern, err, sentence, count=1)
print modified_sentence
Output --> I will go tomorrow vo the
I asked a question a little while ago (Python splitting unknown string by spaces and parentheses) which worked great until I had to change my way of thinking. I have still not grasped regex so I need some help with this.
If the user types this:
new test (test1 test2 test3) test "test5 test6"
I would like it to look like the output to the variable like this:
["new", "test", "test1 test2 test3", "test", "test5 test6"]
In other words if it is one word seperated by a space then split it from the next word, if it is in parentheses then split the whole group of words in the parentheses and remove them. Same goes for the quotation marks.
I currently am using this code which does not meet the above standard (From the answers in the link above):
>>>import re
>>>strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
>>>[", ".join(x.split()) for x in re.split(r'[()]',strs) if x.strip()]
>>>['Hello', 'Test1, test2', 'Hello1, hello2', 'other_stuff']
This works well but there is a problem, if you have this:
strs = "Hello Test (Test1 test2) (Hello1 hello2) other_stuff"
It combines the Hello and Test as one split instead of two.
It also doesn't allow the use of parentheses and quotation marks splitting at the same time.
The answer was simply:
re.findall('\[[^\]]*\]|\([^\)]*\)|\"[^\"]*\"|\S+',strs)
This is pushing what regexps can do. Consider using pyparsing instead. It does recursive descent. For this task, you could use:
from pyparsing import *
import string, re
RawWord = Word(re.sub('[()" ]', '', string.printable))
Token = Forward()
Token << ( RawWord |
Group('"' + OneOrMore(RawWord) + '"') |
Group('(' + OneOrMore(Token) + ')') )
Phrase = ZeroOrMore(Token)
Phrase.parseString(s, parseAll=True)
This is robust against strange whitespace and handles nested parentheticals. It's also a bit more readable than a large regexp, and therefore easier to tweak.
I realize you've long since solved your problem, but this is one of the highest google-ranked pages for problems like this, and pyparsing is an under-known library.
Your problem is not well defined.
Your description of the rules is
In other words if it is one word seperated by a space then split it
from the next word, if it is in parentheses then split the whole group
of words in the parentheses and remove them. Same goes for the commas.
I guess with commas you mean inverted commas == quotation marks.
Then with this
strs = "Hello (Test1 test2) (Hello1 hello2) other_stuff"
you should get that
["Hello (Test1 test2) (Hello1 hello2) other_stuff"]
since everything is surrounded by inverted commas. Most probably, you want to work with no care of largest inverted commas.
I propose this, although a bot ugly
import re, itertools
strs = raw_input("enter a string list ")
print [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
gets
>>>
enter a string list here there (x y ) thereagain "there there"
['here there ', 'x y ', ' thereagain ', 'there there']
This is doing what you expect
import re, itertools
strs = raw_input("enter a string list ")
res1 = [ y for y in list(itertools.chain(*[re.split(r'\"(.*)\"', x)
for x in re.split(r'\((.*)\)', strs)]))
if y <> '']
set1 = re.search(r'\"(.*)\"', strs).groups()
set2 = re.search(r'\((.*)\)', strs).groups()
print [k for k in res1 if k in list(set1) or k in list(set2) ]
+ list(itertools.chain(*[k.split() for k in res1 if k
not in set1 and k not in set2 ]))
For python 3.6 - 3.8
I had a similar question, however I like none of those answers, maybe because most of them are from 2013. So I elaborated my own solution.
regex = r'\(.+?\)|".+?"|\w+'
test = 'Hello Test (Test1 test2) (Hello1 hello2) other_stuff'
result = re.findall(regex, test)
Here you are looking for three different groups:
Something that is included inside (); parenthesis should be written together with backslashes
Something that is included inside ""
Just words
The use of ? makes your search lazy instead of greedy
What is the best way to search for matching words inside a string?
Right now I do something like the following:
if re.search('([h][e][l][l][o])',file_name_tmp, re.IGNORECASE):
Which works but its slow as I have probably around 100 different regex statements searching for full words so I'd like to combine several using a | separator or something.
>>> words = ('hello', 'good\-bye', 'red', 'blue')
>>> pattern = re.compile('(' + '|'.join(words) + ')', re.IGNORECASE)
>>> sentence = 'SAY HeLLo TO reD, good-bye to Blue.'
>>> print pattern.findall(sentence)
['HeLLo', 'reD', 'good-bye', 'Blue']
Can you try:
if 'hello' in longtext:
or
if 'HELLO' in longtext.upper():
to match hello/Hello/HELLO.
If you are trying to check 'hello' or a complete word in a string, you could also do
if 'hello' in stringToMatch:
... # Match found , do something
To find various strings, you could also use find all
>>>toMatch = 'e3e3e3eeehellloqweweemeeeeefe'
>>>regex = re.compile("hello|me",re.IGNORECASE)
>>>print regex.findall(toMatch)
>>>[u'me']
>>>toMatch = 'e3e3e3eeehelloqweweemeeeeefe'
>>>print regex.findall(toMatch)
>>>[u'hello', u'me']
>>>toMtach = 'e3e3e3eeeHelLoqweweemeeeeefe'
>>>print regex.findall(toMatch)
>>>[u'HelLo', u'me']
You say you want to search for WORDS. What is your definition of a "word"? If you are looking for "meet", do you really want to match the "meet" in "meeting"? If not, you might like to try something like this:
>>> import re
>>> query = ("meet", "lot")
>>> text = "I'll meet a lot of friends including Charlotte at the town meeting"
>>> regex = r"\b(" + "|".join(query) + r")\b"
>>> re.findall(regex, text, re.IGNORECASE)
['meet', 'lot']
>>>
The \b at each end forces it to match only at word boundaries, using re's definition of "word" -- "isn't" isn't a word, it's two words separated by an apostrophe. If you don't like that, look at the nltk package.