Split by regex of new line and capital letter - python

I've been struggling to split my string by a regex expression in Python.
I have a text file which I load that is in the format of:
"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line"
I'd like to get the following output:
['Peter went to the gym; he worked out for two hours','Kyle ate lunch
at Kate's house. He went home at 9.', 'Some other sentence here',
'\u2022Here's a bulleted line']
I'm looking to split my string by a new line and a capital letter or a bullet point in Python.
I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.
Here's what I have so far:
print re.findall(r'\n[A-Z][a-z]+',str,re.M)
This just gives me:
[u'\nKyle', u'\nSome']
which is just the first word. I've tried variations of that regex expression but I don't know how to get the rest of the line.
I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter. Is this the best way?
I hope this makes sense and I'm sorry if my question is in anyway unclear. :)

You can use this split function:
>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)
[u'Peter went to the gym; \nhe worked out for two hours ',
u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
u'Some other sentence here',
u"\u2022Here's a bulleted line"]
Code Demo

You can split at a \n proceeded by a capital letter or the bullet character:
import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))
Output:
['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]
Or, without using the symbol for the bullet character:
new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))

Related

How to add a string to a paragraph after a particular word Python

I am trying to insert a string after a specific word in a paragraph.
For example:
If the word I am looking at is King in the paragraph, I want to add Arthur right after it so it will be like:
King Arthur went on a mission
for all the lines that has King instead of
King went on a mission
This example uses re.sub to substitute King (not followed by Arthur) with King Arthur):
import re
s = "King went on a mission"
s = re.sub(r"\bKing\b(?!\s*Arthur)", "King Arthur", s)
print(s)
Prints:
King Arthur went on a mission

Trying to find two words before and after a group of words with regex

sentence = "I love the grand mother bag i bought . I love my sister's ring "
import re
regex = re.search('(\w+){2}the grand mother bag(\w+){2}', sentence)
print(regex.groups())
I should have extracted: I love and I bought.
Any idea where I went wrong?
Change your regex: \w does not match word but a character so you extract only 2 characters:
>>> re.search('(\w+\s+\w+)\s+the grand mother bag\s+(\w+\s+\w+)', sentence).groups()
('I love', 'i bought')

python regex match a possible word

I want to match a regex to match a word that might not exist. I read here that I should try something like this:
import re
line = "a little boy went to the small garden and ate an apple"
res = re.findall("a (little|big) (boy|girl) went to the (?=.*\bsmall\b) garden and ate a(n?)",line)
print res
but the output of this is
[]
which is also the output if I set line to be
a little boy went to the garden and ate an apple
How do I allow for a possible word to exist or not exist in my text and catch it if it exist?
First, you need to match not only a "small" word, but also a space after that (or before that). So you could use regex like this: (small )?.
On the other hand you want to catch the word only. To exclude the match from capturing you should use regex like this: (?:(small) )?
Full example:
import re
lines = [
'a little boy went to the small garden and ate an apple',
'a little boy went to the garden and ate an apple'
]
for line in lines:
res = re.findall(r'a (little|big) (boy|girl) went to the (?:(small) )?garden and ate a(n?)', line)
print res
Output:
[('little', 'boy', 'small', 'n')]
[('little', 'boy', '', 'n')]

Iterate and replace words in lines of a tuple python

I want to iterate through this tuple and for each line, iterate through the words to find and replace some words (internet addresses, precisely) using regex while leaving them as lines.
aList=
[
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a l... T.CO/CHPBRVH9WKk",
"#_EdenRodwell \ud83d\ude29\ud83d\ude29ahh I love you!! Missing u, McDonald's car park goss soon please \u2764\ufe0f\u2764\ufe0fxxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant T.CO?BUXLVZFDWQ",
"want to go on holiday again, missing the sun\ud83d\ude29\u2600\ufe0f"
]
This code below almost does that, but it breaks the list into words separated by lines:
i=0
while i<len(aList):
for line in aList[i].split():
line = re.sub(r"^[http](.*)\/(.*)$", "", line)
print (line)
i+=1
I'd love to have results as with the exception of the internet addresses in each line:
[
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a ",
"#_EdenRodwell \ud83d\ude29\ud83d\ude29ahh I love you!! Missing u, McDonald's car park goss soon please \u2764\ufe0f\u2764\ufe0fxxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant",
"want to go on holiday again, missing the sun\ud83d\ude29\u2600\ufe0f"
]
Thanks
From this:
re.sub(r"^[http](.*)\/(.*)$", "", line)
it looks to me as if you expect that all your URLs will be at the end of the line. In that case, try:
[re.sub('http://.*', '', s) for s in aList]
Here, http:// matches anything that starts with http://. .* matches everything that follows.
Example
Here is your list with some URLs added:
aList = [
"being broken changes people, \nand rn im missing the old me",
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
"#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a http://example.com/CHPBRVH9WKk",
"#_EdenRodwell ahh I love you!! Missing u, McDonald's car park goss soon please xxxxx",
"This was my ring tone, before I decided change was good and missing a call was insignificant http://example.com?BUXLVZFDWQ",
"want to go on holiday again, missing the sun"
]
Here is the result:
>>> [re.sub('http://.*', '', s) for s in aList]
['being broken changes people, \nand rn im missing the old me',
"#SaifAlmazroui #troyboy621 #petr_hruby you're all missing the point",
'#News #Detroit Detroit water customer receives shutoff threat over missing 10 cents: - Theresa Braxton is a ',
"#_EdenRodwell ahh I love you!! Missing u, McDonald's car park goss soon please xxxxx",
'This was my ring tone, before I decided change was good and missing a call was insignificant ',
'want to go on holiday again, missing the sun']
Your question is a little unclear, but I think I get what you're going for
newlist = [re.sub(r"{regex}", "", line) for line in alist]
Should iterate through a list of strings and replace any strings that match your regex pattern with an empty string using a python list comprehension
side note:
Looking closer at your regex it looks like its not doing what you think its doing
I would look at this stack over flow post about matching urls in regex
Regex to find urls in string in Python

Regex uppercase words with condition

I'm new to regex and I can't figure it out how to do this:
Hello this is JURASSIC WORLD shut up Ok
[REVIEW] The movie BATMAN is awesome lol
What I need is the title of the movie. It will be only one per sentence. I have to ignore the words between [] as it will not be the title of the movie.
I thought of this:
^\w([A-Z]{2,})+
Any help would be welcome.
Thanks.
You can use negative look arounds to ensure that the title is not within []
\b(?<!\[)[A-Z ]{2,}(?!\])\b
\b Matches word boundary.
(?<!\[) Negative look behind. Checks if the matched string is not preceded by [
[A-Z ]{2,} Matches 2 or more uppercase letters.
(?!\]) Negative look ahead. Ensures that the string is not followed by ]
Example
>>> string = """Hello this is JURASSIC WORLD shut up Ok
... [REVIEW] The movie BATMAN is awesome lol"""
>>> re.findall(r'\b(?<!\[)[A-Z ]{2,}(?!\])\b', string)
[' JURASSIC WORLD ', ' BATMAN ']
>>>

Categories