Getting rid of few entities using regex python - python

I am new to Regex. Given the below phrase I want to get rid of the I's and the extra field appearing because of using two regex operation.
text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "
For example
I would like to retain "International Business Machine" as "International Business Machine" and not "Capital I's" as "Capital I's" but "Capital"
I used the below Regular Expression:
re.findall('([A-Z][\w\']*(?:\s+[A-Z][\w|\']*)+)|([A-Z][\w]*)', text)
The output I received is
[('', 'I'),
('', 'Regex'),
('', 'How'),
('', 'I'),
("Capital I's", ''),
('', 'I'),
('', 'Capital'),
('International Business Machine', '')]
However I would Like my Output to be as :
[('Regex'),
('How'),
("Capital"),
('Capital'),
('International Business Machine')]
How do I get rid of the "I" and the extra field appearing because of using two regex operation.
Thanks

Just match the word which starts with a captital letter followed by one or more word characters and then add a pattern to match the following words which should be like the previous one(starts with captital letter) and make that pattern to repeat zero or more times. So that it would match strings like Foo or Foo Bar Buzz.
>>> text= "I have a problem in Regex, How do I get rid of the Capital I's provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine "
>>> import re
>>> re.findall(r'\b[A-Z]\w+(?:\s+[A-Z]\w+)*', text)
['Regex', 'How', 'Capital', 'Capital', 'International Business Machine']

If you want to match also apostrophes(like in your example), you can try with:
(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+
DEMO
it will match ' if it is preceded by at least two word characters. Not too fancy solution but works. Then:
import re
text = 'I have a problem in Regex, How do I get rid of the Capital I\'s provided I want to retain words occurring together as logical entity with a Capital letter in the beginning of each word like International Business Machine'
found = re.findall('(?:[A-Z](?:[\w]|(?<=\w\w)\')+\s?)+',text)
print found
will also give a result:
['Regex', 'How ', 'Capital ', 'Capital ', 'International Business Machine']

Related

General regex about english words, numbers, length and specific symbols

I am working in an NLP task and I want to do a general cleaning for a specific purpose that doesn't matter to explain further.
I want a function that
Remove non English words
Remove words that are full in capital
Remove words that have '-' in the beginning or the end
Remove words that have length less than 2 characters
Remove words that have only numbers
For example if I have the following string
'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'
the output should be
'George play 123_134 _123pantis'
The function that I have created already is the following:
def clean(text):
# remove words that aren't in english words (isn't working)
#text = re.sub(r'^[a-zA-Z]+', '', text)
# remove words that are in capital
text = re.sub(r'(\w*[A-Z]+\w*)', '', text)
# remove words that start or have - in the middle (isn't working)
text = re.sub(r'(\s)-\w+', '', text)
# remove words that have length less than 2 characters (is working)
text = re.sub(r'\b\w{1,2}\b', '', text)
# remove words with only numbers
text = re.sub(r'[0-9]+', '', text) (isn't working)
return text
The output is
- play _ foot-ball _ _pantis ελλαδα
which is not what I need. Thank you very much for your time and help!
You can do this in single re.sub call.
Search using this regex:
(?:\b(?:\w+(?=-)|\w{2}|\d+|[A-Z]+|\w*[^\x01-\x7F]\w*)\b|-\w+)\s*
and replace with empty string.
RegEx Demo
Code:
import re
s = 'George -wants to play 123_134 foot-ball in _123pantis FOOTBALL ελλαδα 123'
r = re.sub(r'(?:\b(?:\w+(?=-)|\w{2}|\d+|[A-Z]+|\w*[^\x01-\x7F]\w*)\b|-\w+)\s*', '', s)
print (r)
# George play 123_134 _123pantis
Online Code Demo

Regex noob question: getting several words/sentences from one line, max separation being 1 whitespace?

I'm not terribly familiar with Python regex, or regex in general, but I'm hoping to demystify it all a bit more with time.
My problem is this: given a string like ' Apple Banana Cucumber Alphabetical Fruit Whoops', I'm trying to use python's 're.findall' module to result in a list that looks like this: my_list = [' Apple', ' Banana', ' Cucumber', ' Alphabetical Fruit', ' Whoops']. In other words, I'm trying to find a regex expression that can [look for a bunch of whitespace followed by some non-whitespace], and then check if there is a single space with some more non-whitespace characters after that.
This is the function I've written that gets me cloooose but not quite:
re.findall("\s+\S+\s{1}\S*", my_list)
Which results in:
[' Apple ', ' Banana ', ' Cucumber ', ' Alphabetical Fruit']
I think this result makes sense. It first finds the whitespace, then some non-whitespace, but then it looks for at least one whitespace (which leaves out 'Whoops'), and then looks for any number of other non-whitespace characters (which is why there's no space after 'Alphabetical Fruit'). I just don't know what character combination would give me the intended result.
Any help would be hugely appreciated!
-WW
You can do:
\s+\w+(?:\s\w+)?
\s+\w+ macthes one or more whitespaces, followed by one or more of [A-Za-z0-9_]
(?:\s\w+)? is a conditional (?, zero or one) non-captured group ((?:)) that matches a whitespace (\s) followed by one or more of [A-Za-z0-9_] (\w+). Essentially this is to match Fruit in Alphabetical Fruit.
Example:
In [701]: text = ' Apple Banana Cucumber Alphabetical Fruit Whoops'
In [702]: re.findall(r'\s+\w+(?:\s\w+)?', text)
Out[702]:
[' Apple',
' Banana',
' Cucumber',
' Alphabetical Fruit',
' Whoops']
Your pattern works already, just make the second part (the 'compound word' part) optional:
\s+\S+(\s\S+)?
https://regex101.com/r/Ua8353/3/
(fixed \s{1} per #heemayl)

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Regex uppercase words with condition

I'm new to regex and I can't figure it out how to do this:
Hello this is JURASSIC WORLD shut up Ok
[REVIEW] The movie BATMAN is awesome lol
What I need is the title of the movie. It will be only one per sentence. I have to ignore the words between [] as it will not be the title of the movie.
I thought of this:
^\w([A-Z]{2,})+
Any help would be welcome.
Thanks.
You can use negative look arounds to ensure that the title is not within []
\b(?<!\[)[A-Z ]{2,}(?!\])\b
\b Matches word boundary.
(?<!\[) Negative look behind. Checks if the matched string is not preceded by [
[A-Z ]{2,} Matches 2 or more uppercase letters.
(?!\]) Negative look ahead. Ensures that the string is not followed by ]
Example
>>> string = """Hello this is JURASSIC WORLD shut up Ok
... [REVIEW] The movie BATMAN is awesome lol"""
>>> re.findall(r'\b(?<!\[)[A-Z ]{2,}(?!\])\b', string)
[' JURASSIC WORLD ', ' BATMAN ']
>>>

Tokenizing first and last name as one token

Is is possible to tokenize a text in tokens such that first and last name are combined in one token?
For example if my text is:
text = "Barack Obama is the President"
Then:
text.split()
results in:
['Barack', 'Obama', 'is', 'the, 'President']
how can I recognize the first and last name? So I get only ['Barack Obama', 'is', 'the', 'President'] as tokens.
Is there a way to achieve it in Python?
What you are looking for is a named entity recognition system. I suggest you do not consider this as part of tokenization.
For python you can use https://pypi.python.org/pypi/ner/
Example from the site
>>> tagger.json_entities("Alice went to the Museum of Natural History.")
'{"ORGANIZATION": ["Museum of Natural History"], "PERSON": ["Alice"]}'
Here's a regular expression that meets the needs of your question. It will find individual words beginning with a lowercase character, or match singleton or pairs of capitalized words.
import re
re.findall(r"[a-z]\w+|[A-Z]\w+(?: [A-Z]\w+)?",text)
outputs
['Barack Obama', 'is', 'the', 'President']

Categories