My question is regarding the following tweets:
Credit Suisse Trims Randgold Resources Limited (RRS) Target Price to GBX
JPMorgan Chase & Co Trims Occidental Petroleum Co (OXY) Target Price to
I want to remove "Randgold Resources Limited (RRS)" from the first tweet and "Occidental Petroleum Co (OXY)" from the second tweet using Regex.
I am working in Python and so far I have tried this without much luck:
Trims\s[\w\s.()]+(?=Target)
I want to capture the phrase "Trims Target Price" in both instances. Help would be appreciated.
You can use this lookaround based regex:
p = re.compile(r'(?<= Trims) .*?(?= Target )')
result = re.sub(p, "", test_str)
(?<= Trims) .*?(?= Target ) will match any text that is between Trim and Target.
RegEx Demo
(?<=Trims )([A-Z][a-z]+ ){3}\([A-Z]{3}\)
See it in action
The idea is:
(?<=Trims ) - find a place preceded by Trims using positive lookbehind
[A-Z][a-z]+ - a word starting with capital letter that continues with multiple lower case letters
([A-Z][a-z]+ ){3} - three such words followed by space
\( and \) - brackets have to be escaped, otherwise they have the meaning of capturing group
[A-Z]{3} - three capital letters
The (?<=...) Lookbehind assertion, match if preceded is missing for Trims word.
re.sub('(?<=Trims)\s[\w\s.()]+(?=Target)', ' ', text)
Related
I'm trying to build a function that will collect an acronym using only regular expressions.
Example:
Data Science = DS
I'm trying to do 3 steps:
Find the first letter of each word
Translate every single letter to uppercase.
Group
Unfortunately I get errors.
I repeat that I need to use the regular expression functionality.
Regular expression for creating an acronym.
some_words = 'Data Science'
all_words_select = r'(\b\w)'
word_upper = re.sub(all_words_select, some_words.upper(), some_words)
print(word_upper)
result:
DATA SCIENCEata DATA SCIENCEcience
Why is the text duplicated?
I plan to get: DATA SCIENCE
You don't need regex for the problem you have stated. You can just split the words on space, then take the first character and convert it to the upper case, and finally join them all.
>>> ''.join(w[0].upper() for w in some_words.split(' '))
>>> 'DS'
You need to deal with special condition such as word starting with character other than alphabets, with something like if w[0].isalpha()
The another approach using re.sub and negative lookbehind:
>>> re.sub(r'(?<!\b).|\s','', some_words)
'DS'
Use
import re
some_words = 'Data Science'
all_words_select = r'\b(?![\d_])(\w)|.'
word_upper = re.sub(all_words_select, lambda z: z.group(1).upper() if z.group(1) else '', some_words, flags=re.DOTALL)
print(word_upper)
See Python proof.
EXPLANATION
Match a letter at the word beginning => capture (\b(?![\d_])(\w))
Else, match any character (|.)
Whenever capture is not empty replace with a capital variant (z.group(1).upper())
Else, remove the match ('').
Pattern:
--------------------------------------------------------------------------------
\b the boundary between a word char (\w) and
something that is not a word char
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
[\d_] any character of: digits (0-9), '_'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
( group and capture to \1:
--------------------------------------------------------------------------------
\w word characters (a-z, A-Z, 0-9, _)
--------------------------------------------------------------------------------
) end of \1
--------------------------------------------------------------------------------
| OR
--------------------------------------------------------------------------------
. any character except \n
I need some help on declaring a regex. My inputs are like the following:
I need to extract word and before word and insert between ”_” in regex:python
Input
Input
s2 = 'Some other medical terms and stuff diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'
# my regex pattern
re.sub(r"(?:[a-zA-Z'-]+[^a-zA-Z'-]+){0,1}diagnosis", r"\1_", s2)
Desired Output:
s2 = 'Some other medical terms and stuff_diagnosis of R45.2 was entered for this patient. Where did Doctor Who go? Then xxx feea fdsfd'
You have no capturing group defined in your regex, but are using \1 placeholder (replacement backreference) to refer to it.
You want to replace 1+ special chars other than - and ' before the word diagnosis, thus you may use
re.sub(r"[^\w'-]+(?=diagnosis)", "_", s2)
See this regex demo.
Details
[^\w'-]+ - any non-word char excluding ' and _
(?=diagnosis) - a positive lookahead that does not consume the text (does not add to the match value and thus re.sub does not remove this piece of text) but just requires diagnosis text to appear immediately to the right of the current location.
Or
re.sub(r"[^\w'-]+(diagnosis)", r"_\1", s2)
See this regex demo. Here, [^\w'-]+ also matches those special chars, but (diagnosis) is a capturing group whose text can be referred to using the \1 placeholder from the replacement pattern.
NOTE: If you want to make sure diagnosis is matched as a whole word, use \b around it, \bdiagnosis\b (mind the r raw string literal prefix!).
Input is a two-sentence string:
s = 'Sentence 1 here. This sentence contains 1 fl. oz. but is one sentence.'
I'd like to .split s into sentences based on the logic that:
sentences end with one or more periods, exclamation marks, questions marks, or period+quotation mark
and are then followed by 1+ whitespace characters and a capitalized alpha character.
Desired result:
['Sentence 1 here.', 'This sentence contains 1 fl. oz. but is one sentence.']
Also okay:
['Sentence 1 here', 'This sentence contains 1 fl. oz. but is one sentence.']
But I currently chop off the 0th element of each sentence because the uppercase character is captured:
import re
END_SENT = re.compile(r'[.!?(.")]+[ ]+[A-Z]')
print(END_SENT.split(s))
['Sentence 1 here', 'his sentence contains 1 fl. oz. but is one sentence.']
Notice the missing T. How can I tell .split to ignore certain elements of the compiled pattern?
((?<=[.!?])|(?<=\.\")) +(?=[A-Z])
Try it here.
Although I would suggest the below to allow quotes to be followed by any of .!? to be a split condition
((?<=[.!?])|(?<=[.!?]\")) +(?=[A-Z])
Try it here.
Explanation
The common stuff in both +(?=[A-Z])
' +' #One or more spaces(The actual splitting chars used.)
(?= #START positive look ahead check if it followed by this, but do not consume
[A-Z] #Any capitalized alphabet
) #END positive look ahead
The conditions for what comes before the space
For Solution1
( #GROUP START
(?<= #START Positive look behind, Make sure this comes before but do not consume
[.!?] #any one of these chars should come before the splitting space
) #END positive look behind
| #OR condition this is also the reason we had to put all this in GROUP
(?<= #START Positive look behind,
\.\" #splitting space could precede by .", covering a condition that is not by the previous set of . or ! or ?
) #END positive look behind
) #END GROUP
For Solution2
( #GROUP START
(?<=[.!?]) #Same as the previous look behind
| #OR condition
(?<=[.!?]\") #Only difference here is that we are allowing quote after any of . or ! or ?
) #GROUP END
It's easier to describe the sentence than trying to identify the delimiter. So instead of re.split try with re.findall:
re.findall(r'([^.?!\s].*?[.?!]*)\s*(?![^A-Z])', s)
To preserve the next uppercase letter, the pattern uses a lookahead that is only a test and doesn't consume characters.
details:
( # capture group: re.findall return only the capture group content if any
[^.?!\s] # the first character isn't a space or a punctuation character
.*? # a non-greedy quantifier
[.?!]* # eventual punctuation characters
)
\s* # zero or more white-spaces
(?![^A-Z]) # not followed by a character that isn't a uppercase letter
# (this includes an uppercase letter and the end of the string)
Obviously, for more complicated cases with abbreviations, names, etc., you have to use tools like nltk or any other nlp tools trained with dictionaries.
I want a regex in Python which extracts one or multiple occurrences of words starting with capital letters unless the word occurs in the first word. I know it's not a robust and consistent method but it'll solve my problem as I don't want to use any statistical method (e.g. as in NLTK or StanfordNER).
Examples:
extract('His name is John Wayne.')
should return ['John Wayne'].
extract('He is The President of Neverland.')
should return ['The President', 'Neverland'] because they are capitalized words and they don't occur at the beginning of a sentence.
another example:
extract('He came home. Although late, it was nice to have Patrick there.')
should return ['Patrick'] because 'He' and 'Although' occur at the beginning of a sentence.
Also it could drop punctuation for example 'He was John, who came' should return 'John' and not 'John,'.
You can use this expression for this task:
(?<!\.\s)(?!^)\b([A-Z]\w*(?:\s+[A-Z]\w*)*)
RegEx Demo
RegEx Breakup:
(?<!\.\s) - Negative lookbehind to assert we don't have a DOT and space before
(?!^) - Negative lookahead to assert we are not at start
\b - Word boundary
( - Start capturing group
[A-Z]\w* - Match a word starting with a capital letter
(?: - Start non-capturing group
\s+ - Match 1 or more whitespaces
[A-Z]\w* - Match a capital letter word
)* End non-capturing group. Match 0 ore more of these
) - End capturing group
I have the following regex:
res = re.finditer(r'(?:\w+[ \t,]+){0,4}my car',txt,re.IGNORECASE|re.MULTILINE)
for item in res:
print(item.group())
When I use this regex with the following string:
"my house is painted white, my car is red.
A horse is galloping very fast in the road, I drive my car slowly."
I am getting the following results:
house is painted white, my car
the road, I drive my car
My question is about the quantifier {0,4} that should apply to the whole group. The group collects words with the expression \w+ and some separation symbols with the [ ]. Does the the quantifier apply only to the "words" defined by \w+? In the results I am getting 4 words plus space and comma. It's unclear to me.
So, here's what's happening. You're using ?: to make a non capture group, which collects 1 or more "words", followed by a [ \t,] (a space, tab char, or comma), match one or more of the preceeding. {0,4} matches between 0-4 of the non-capturing group. So it looks at the word "my car" and captures the 4 words before it, since all 4 of them match the \w+ and the , and space get eaten by the character set you specified.
Broken apart more succinctly
(?: -- Non capturing group
\w+ Grab all words
[ \t,]+ -- Grab all spaces, comma, or tab characters
) -- End capture group
{0,4} -- Match the previous capture group 0-4 times
my car -- Based off where you find the words "my car"
As a result this will match 0-4 words / spaces / commas / tabs before the appearance of "my car"
This is working as written