python regex match a possible word - python

I want to match a regex to match a word that might not exist. I read here that I should try something like this:
import re
line = "a little boy went to the small garden and ate an apple"
res = re.findall("a (little|big) (boy|girl) went to the (?=.*\bsmall\b) garden and ate a(n?)",line)
print res
but the output of this is
[]
which is also the output if I set line to be
a little boy went to the garden and ate an apple
How do I allow for a possible word to exist or not exist in my text and catch it if it exist?

First, you need to match not only a "small" word, but also a space after that (or before that). So you could use regex like this: (small )?.
On the other hand you want to catch the word only. To exclude the match from capturing you should use regex like this: (?:(small) )?
Full example:
import re
lines = [
'a little boy went to the small garden and ate an apple',
'a little boy went to the garden and ate an apple'
]
for line in lines:
res = re.findall(r'a (little|big) (boy|girl) went to the (?:(small) )?garden and ate a(n?)', line)
print res
Output:
[('little', 'boy', 'small', 'n')]
[('little', 'boy', '', 'n')]

Related

How to add a missing closing parenthesis to a string in Python?

I have multiple strings to postprocess, where a lot of the acronyms have a missing closing bracket. Assume the string text below, but also assume that this type of missing bracket happens often.
My code below only works by adding the closing bracket to the missing acronym independently, but not to the full string/sentence. Any tips on how to do this efficiently, and preferably without needing to iterate ?
import re
#original string
text = "The dog walked (ABC in the park"
#Desired output:
desired_output = "The dog walked (ABC) in the park"
#My code:
acronyms = re.findall(r'\([A-Z]*\)?', text)
for acronym in acronyms:
if ')' not in acronym: #find those without a closing bracket ')'.
print(acronym + ')') #add the closing bracket ')'.
#current output:
>>'(ABC)'
You may use
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
With this approach, you can also get rid of the check if the text has ) in it before, see a demo on regex101.com.
In full:
import re
#original string
text = "The dog walked (ABC in the park"
text = re.sub(r'(\([A-Z]+(?!\))\b)', r"\1)", text)
print(text)
This yields
The dog walked (ABC) in the park
See a working demo on ideone.com.
For the typical example you have provided, I don't see the need of using regex
You can just use some string methods:
text = "The dog walked (ABC in the park"
withoutClosing = [word for word in text.split() if word.startswith('(') and not word.endswith(')') ]
withoutClosing
Out[45]: ['(ABC']
Now you have the words without closing parenthesis, you can just replace them:
for eachWord in withoutClosing:
text = text.replace(eachWord, eachWord+')')
text
Out[46]: 'The dog walked (ABC) in the park'

Regex noob question: getting several words/sentences from one line, max separation being 1 whitespace?

I'm not terribly familiar with Python regex, or regex in general, but I'm hoping to demystify it all a bit more with time.
My problem is this: given a string like ' Apple Banana Cucumber Alphabetical Fruit Whoops', I'm trying to use python's 're.findall' module to result in a list that looks like this: my_list = [' Apple', ' Banana', ' Cucumber', ' Alphabetical Fruit', ' Whoops']. In other words, I'm trying to find a regex expression that can [look for a bunch of whitespace followed by some non-whitespace], and then check if there is a single space with some more non-whitespace characters after that.
This is the function I've written that gets me cloooose but not quite:
re.findall("\s+\S+\s{1}\S*", my_list)
Which results in:
[' Apple ', ' Banana ', ' Cucumber ', ' Alphabetical Fruit']
I think this result makes sense. It first finds the whitespace, then some non-whitespace, but then it looks for at least one whitespace (which leaves out 'Whoops'), and then looks for any number of other non-whitespace characters (which is why there's no space after 'Alphabetical Fruit'). I just don't know what character combination would give me the intended result.
Any help would be hugely appreciated!
-WW
You can do:
\s+\w+(?:\s\w+)?
\s+\w+ macthes one or more whitespaces, followed by one or more of [A-Za-z0-9_]
(?:\s\w+)? is a conditional (?, zero or one) non-captured group ((?:)) that matches a whitespace (\s) followed by one or more of [A-Za-z0-9_] (\w+). Essentially this is to match Fruit in Alphabetical Fruit.
Example:
In [701]: text = ' Apple Banana Cucumber Alphabetical Fruit Whoops'
In [702]: re.findall(r'\s+\w+(?:\s\w+)?', text)
Out[702]:
[' Apple',
' Banana',
' Cucumber',
' Alphabetical Fruit',
' Whoops']
Your pattern works already, just make the second part (the 'compound word' part) optional:
\s+\S+(\s\S+)?
https://regex101.com/r/Ua8353/3/
(fixed \s{1} per #heemayl)

Split by regex of new line and capital letter

I've been struggling to split my string by a regex expression in Python.
I have a text file which I load that is in the format of:
"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line"
I'd like to get the following output:
['Peter went to the gym; he worked out for two hours','Kyle ate lunch
at Kate's house. He went home at 9.', 'Some other sentence here',
'\u2022Here's a bulleted line']
I'm looking to split my string by a new line and a capital letter or a bullet point in Python.
I've tried tackling the first half of the problem, splitting my string by just a new line and capital letter.
Here's what I have so far:
print re.findall(r'\n[A-Z][a-z]+',str,re.M)
This just gives me:
[u'\nKyle', u'\nSome']
which is just the first word. I've tried variations of that regex expression but I don't know how to get the rest of the line.
I assume that to also split by the bullet point, I would just include an OR regex expression that is in the same format as the regex of splitting by a capital letter. Is this the best way?
I hope this makes sense and I'm sorry if my question is in anyway unclear. :)
You can use this split function:
>>> str = u"Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch at Kate's house. Kyle went home at 9. \nSome other sentence here\n\u2022Here's a bulleted line"
>>> print re.split(u'\n(?=\u2022|[A-Z])', str)
[u'Peter went to the gym; \nhe worked out for two hours ',
u"Kyle ate lunch at Kate's house. Kyle went home at 9. ",
u'Some other sentence here',
u"\u2022Here's a bulleted line"]
Code Demo
You can split at a \n proceeded by a capital letter or the bullet character:
import re
s = """
Peter went to the gym; \nhe worked out for two hours \nKyle ate lunch
at Kate's house. Kyle went home at 9. \nSome other sentence
here\n\u2022Here's a bulleted line
"""
new_list = filter(None, re.split('\n(?=•)|\n(?=[A-Z])', s))
Output:
['Peter went to the gym; \nhe worked out for two hours ', "Kyle ate lunch \nat Kate's house. Kyle went home at 9. ", 'Some other sentence \nhere', "•Here's a bulleted line\n"]
Or, without using the symbol for the bullet character:
new_list = filter(None, re.split('\n(?=\u2022)|\n(?=[A-Z])', s))

Find lots of string in text - Python

I'm searching for the best algorithm to resolve this problem: having a list (or a dict, a set) of small sentences, find the all occurrences of this sentences in a bigger text. The sentences in the list (or dict, or set) are about 600k but formed, on average, by 3 words. The text is, on average, 25 words long. I've just formatted the text (deleting punctuation, all lowercase and go on like this).
Here is what I have tried out (Python):
to_find_sentences = [
'bla bla',
'have a tea',
'hy i m luca',
'i love android',
'i love ios',
.....
]
text = 'i love android and i think i will have a tea with john'
def find_sentence(to_find_sentences, text):
text = text.split()
res = []
w = len(text)
for i in range(w):
for j in range(i+1,w+1):
tmp = ' '.join(descr[i:j])
if tmp in to_find_sentences:
res.add(tmp)
return res
print find_sentence(to_find_sentence, text)
Out:
['i love android', 'have a tea']
In my case I've used a set to speed up the in operation
A fast solution would be to build a Trie out of your sentences and convert this trie to a regex. For your example, the pattern would look like this:
(?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))
Here's an example on debuggex:
It might be a good idea to add '\b' as word boundaries, to avoid matching "have a team".
You'll need a small Trie script. It's not an official package yet, but you can simply download it here as trie.py in your current directory.
You can then use this code to generate the trie/regex:
import re
from trie import Trie
to_find_sentences = [
'bla bla',
'have a tea',
'hy i m luca',
'i love android',
'i love ios',
]
trie = Trie()
for sentence in to_find_sentences:
trie.add(sentence)
print(trie.pattern())
# (?:bla\ bla|h(?:ave\ a\ tea|y\ i\ m\ luca)|i\ love\ (?:android|ios))
pattern = re.compile(r"\b" + trie.pattern() + r"\b", re.IGNORECASE)
text = 'i love android and i think i will have a tea with john'
print(re.findall(pattern, text))
# ['i love android', 'have a tea']
You invest some time to create the Trie and the regex, but the processing should be extremely fast.
Here's a related answer (Speed up millions of regex replacements in Python 3) if you want more information.
Note that it wouldn't find overlapping sentences:
to_find_sentences = [
'i love android',
'android Marshmallow'
]
# ...
print(re.findall(pattern, "I love android Marshmallow"))
# ['I love android']
You'd have to modifiy the regex with positive lookaheads to find overlapping sentences.

Regex uppercase words with condition

I'm new to regex and I can't figure it out how to do this:
Hello this is JURASSIC WORLD shut up Ok
[REVIEW] The movie BATMAN is awesome lol
What I need is the title of the movie. It will be only one per sentence. I have to ignore the words between [] as it will not be the title of the movie.
I thought of this:
^\w([A-Z]{2,})+
Any help would be welcome.
Thanks.
You can use negative look arounds to ensure that the title is not within []
\b(?<!\[)[A-Z ]{2,}(?!\])\b
\b Matches word boundary.
(?<!\[) Negative look behind. Checks if the matched string is not preceded by [
[A-Z ]{2,} Matches 2 or more uppercase letters.
(?!\]) Negative look ahead. Ensures that the string is not followed by ]
Example
>>> string = """Hello this is JURASSIC WORLD shut up Ok
... [REVIEW] The movie BATMAN is awesome lol"""
>>> re.findall(r'\b(?<!\[)[A-Z ]{2,}(?!\])\b', string)
[' JURASSIC WORLD ', ' BATMAN ']
>>>

Categories