I am trying to insert a string after a specific word in a paragraph.
For example:
If the word I am looking at is King in the paragraph, I want to add Arthur right after it so it will be like:
King Arthur went on a mission
for all the lines that has King instead of
King went on a mission
This example uses re.sub to substitute King (not followed by Arthur) with King Arthur):
import re
s = "King went on a mission"
s = re.sub(r"\bKing\b(?!\s*Arthur)", "King Arthur", s)
print(s)
Prints:
King Arthur went on a mission
sentence = "I love the grand mother bag i bought . I love my sister's ring "
import re
regex = re.search('(\w+){2}the grand mother bag(\w+){2}', sentence)
print(regex.groups())
I should have extracted: I love and I bought.
Any idea where I went wrong?
Change your regex: \w does not match word but a character so you extract only 2 characters:
>>> re.search('(\w+\s+\w+)\s+the grand mother bag\s+(\w+\s+\w+)', sentence).groups()
('I love', 'i bought')
I want to match a regex to match a word that might not exist. I read here that I should try something like this:
import re
line = "a little boy went to the small garden and ate an apple"
res = re.findall("a (little|big) (boy|girl) went to the (?=.*\bsmall\b) garden and ate a(n?)",line)
print res
but the output of this is
[]
which is also the output if I set line to be
a little boy went to the garden and ate an apple
How do I allow for a possible word to exist or not exist in my text and catch it if it exist?
First, you need to match not only a "small" word, but also a space after that (or before that). So you could use regex like this: (small )?.
On the other hand you want to catch the word only. To exclude the match from capturing you should use regex like this: (?:(small) )?
Full example:
import re
lines = [
'a little boy went to the small garden and ate an apple',
'a little boy went to the garden and ate an apple'
]
for line in lines:
res = re.findall(r'a (little|big) (boy|girl) went to the (?:(small) )?garden and ate a(n?)', line)
print res
Output:
[('little', 'boy', 'small', 'n')]
[('little', 'boy', '', 'n')]
I want to iterate through this tuple and for each line, iterate through the words to find and replace some words (internet addresses, precisely) using regex while leaving them as lines.
aList=
[
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a l... T.CO/CHPBRVH9WKk",
"#_EdenRodwell \ud83d\ude29\ud83d\ude29ahh I love you!! Missing u, McDonald's car park goss soon please \u2764\ufe0f\u2764\ufe0fxxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant T.CO?BUXLVZFDWQ",
"want to go on holiday again, missing the sun\ud83d\ude29\u2600\ufe0f"
]
This code below almost does that, but it breaks the list into words separated by lines:
i=0
while i<len(aList):
for line in aList[i].split():
line = re.sub(r"^[http](.*)\/(.*)$", "", line)
print (line)
i+=1
I'd love to have results as with the exception of the internet addresses in each line:
[
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a ",
"#_EdenRodwell \ud83d\ude29\ud83d\ude29ahh I love you!! Missing u, McDonald's car park goss soon please \u2764\ufe0f\u2764\ufe0fxxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant",
"want to go on holiday again, missing the sun\ud83d\ude29\u2600\ufe0f"
]
Thanks
From this:
re.sub(r"^[http](.*)\/(.*)$", "", line)
it looks to me as if you expect that all your URLs will be at the end of the line. In that case, try:
[re.sub('http://.*', '', s) for s in aList]
Here, http:// matches anything that starts with http://. .* matches everything that follows.
Example
Here is your list with some URLs added:
aList = [
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a http://example.com/CHPBRVH9WKk",
"#_EdenRodwell ahh I love you!! Missing u, McDonald's car park goss soon please xxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant http://example.com?BUXLVZFDWQ",
"want to go on holiday again, missing the sun"
]
Here is the result:
>>> [re.sub('http://.*', '', s) for s in aList]
['being broken changes people, \nand rn im missing the old me',
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
'#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a ',
"#_EdenRodwell ahh I love you!! Missing u, McDonald's car park goss soon please xxxxx",
'This was my ring tone, before I decided change was good and missing a call was insignificant ',
'want to go on holiday again, missing the sun']
Your question is a little unclear, but I think I get what you're going for
newlist = [re.sub(r"{regex}", "", line) for line in alist]
Should iterate through a list of strings and replace any strings that match your regex pattern with an empty string using a python list comprehension
side note:
Looking closer at your regex it looks like its not doing what you think its doing
I would look at this stack over flow post about matching urls in regex
Regex to find urls in string in Python
I'm new to regex and I can't figure it out how to do this:
Hello this is JURASSIC WORLD shut up Ok
[REVIEW] The movie BATMAN is awesome lol
What I need is the title of the movie. It will be only one per sentence. I have to ignore the words between [] as it will not be the title of the movie.
I thought of this:
^\w([A-Z]{2,})+
Any help would be welcome.
Thanks.
You can use negative look arounds to ensure that the title is not within []
\b(?<!\[)[A-Z ]{2,}(?!\])\b
\b Matches word boundary.
(?<!\[) Negative look behind. Checks if the matched string is not preceded by [
[A-Z ]{2,} Matches 2 or more uppercase letters.
(?!\]) Negative look ahead. Ensures that the string is not followed by ]
Example
>>> string = """Hello this is JURASSIC WORLD shut up Ok
... [REVIEW] The movie BATMAN is awesome lol"""
>>> re.findall(r'\b(?<!\[)[A-Z ]{2,}(?!\])\b', string)
[' JURASSIC WORLD ', ' BATMAN ']
>>>