Extracting words next to a location or Duration in python - python

How can i extract words next to a location or Duration? What is the best possible regex in python to do this action?
Example:-
Kathick Kumar, Bangalore who was a great person and lived from 29th March 1980 - 21 Dec 2014.
In the above example i want to extract the words before location and the words before duration. Here the location and duration is not fixed, what will be the best possible regex for this in python? Or can we do this using nltk?
Desired output:-
Output-1: Karthick Kumar (Keyword here is Location)
Output-2: who was a great person and lived from (Keyword here is duration)

I suggest using Lookaheads.
In your example, assuming you want the words before Bangalore and 29th March 1980 - 21 Dec 2014, you could use lookaheads( and lookbehinds) to get the relevant match.
I've used this regex: (.*)(?>Bangalore)(.+)(?=29th March 1980 - 21 Dec 2014) and captured the text in parentheses, which can be accessed by using \1 and \2.
DEMO

Related

Python Regex to extract meeting invite from Gmail Subject

I'm trying to extract the meeting date / time from meeting invites within Gmail's subject. Below is an example of a subject for a meeting invite:
Invitation: Bob / Carol Meeting # Tue Oct 25, 2022 11:30am - 12pm (CST) (bob#example.org)
What I would like to extract:
Tue Oct 25, 2022 11:30am - 12pm (CST)
I think the pattern could simply start with the space after the "#" and end with the ")". My Regex is very rusty so would appreciate any help :)
Many thanks!
Try this - it should match everything after the "# " and up to the end of the timezone ")"
import re
string = (
'Invitation: Bob / Carol Meeting # Tue Oct 25, 2022 11:30am - 12pm (CST) (bob#example.org)'
)
pattern = re.compile(r'(?<=# )[^)]+\)')
matches = re.findall(pattern, string)
print(matches)
# => 'Tue Oct 25, 2022 11:30am - 12pm (CST)'
See here for a breakdown of the RegEx I used. Bear in mind that re.findall returns a list of matches, which is helpful if you want to scan a long multiline string of text and get all the matches at once. If you only care about the 1st match, you can get it by index e.g. print(matches[0]).
It looks like you don't technically need regex for this.
Try the following:
>>> s = 'Invitation: Bob / Carol Meeting # Tue Oct 25, 2022 11:30am - 12pm (CST) (bob#example.org)'
>>> s[s.index('#') + 1 : s.rindex('(')].strip()
'Tue Oct 25, 2022 11:30am - 12pm (CST)'

Using RegEx in Python to extract contents

Good evening,
I am very new to Python and RegEx. I have the following sentence:
-75.76 Card INSURANCEGrabPay ASIA DIRECT to Paid AM 1:16 +100.00 3257 UpAmex Top PM 9:55 +300.00 3257 UpAmex Top PM 9:55 -400.00 Card LTDGrabPay PTE AXS to Paid PM 9:57 (SGD) Amount Details Time here. appear will transactions cashless your All 2022 Feb 15 on made transactions GrabPay points 52 earned points Rewards 475.76 SGD spent Amount 0.24 SGD balance Wallet 2022 Feb 15 Summary statement daily your here
I would like to search for just '-' and the amount after that.
After that, I would like to skip 2 words and extract ALL words if need be in a single group (I will read more about groups but for now i would need in a single group, which i can later use to split and get the words from that string) just before 'Paid'
For instance, I would get
-75.76 ASIA Direct to
-400 PTE AXS to
What would be the regex command? Also, is there a good regex tutorial where I can read up on?
For now I have created one match having 2 groups ie, group1 for the amount and group2 for all the words (that include "to " string also).
Regex:
(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid
You can check the details here: https://regex101.com/r/eUMgdW/1
Python code:
import re
output = re.findall("""(-\d+\.?\d+) \w+ \w+ ([\w ]+)?Paid""", your_input_string)
for found in output:
print(found)
#('-75.76', 'ASIA DIRECT to ')
#('-400.00', 'PTE AXS to ')
Rather than give you the actual regex, I'll gently nudge you in the right direction. It's more satisfying that way.
"Words" here are seperated by spaces. So what you're searching for is a group of characters (captured), a space, characters again, space, characters, space, then capture everything and end with "PAID". Try to create a regex to do that.
If you'd like to brush up on regex, check out Regex101. It's a web tool to test out regex, along with a debugger and a cheat sheet.

How to find the words correspond to month and replace it with numerical?

How to find the words that correspond to the month "January, February, March,.. etc." and replace them with numerical "01, 02, 03,.."
I tried the code below
def transformMonths(string):
rep = [("May", "05"), ("June", "06")]
for pat, repl in rep:
s = re.sub(pat, repl, string)
return s
print( transformMonths('I was born on June 24 and my sister was born on May 17') )
My code provides this result ('I was born on 06 24 and my sister was born on May 17')
However, I want the output to be like this ('I was born on 06 24 and my sister was born on 05 17')
You are performing the replacement on the initial (unmodified) string at each iteration so you end up with only one month name being replaced. You can fix that by assigning string instead of s in the loop (and return string at the end).
Note that your approach does not require a regular expression and could use a simple string replace: string = string.replace(pat,repl).
In both cases, because the replacement does not take into account word boundaries, the function would replace partial words such as:
"Mayor Smith was elected on May 25" --> "05or Smith was elected on 05 25".
You can fix that in your regular expression by adding \b before and after each month name. This will ensure that the month names are only found if they are between word boundaries.
The re.sub can perform multiple replacements with varying values if you give it a function instead of a fixed string. So you can build a combined regular expression that will find all the months and replace the words that are found using a dictionary:
import re
def numericMonths(string):
months = {"January":"01", "Ffebruary":"02","March":"03", "April":"04",
"May":"05", "June":"06", "July":"07", "August":"08",
"September":"09","October":"10", "November":"11","December":"12"}
pattern = r"\b("+"|".join(months)+r")\b" # all months as distinct words
return re.sub(pattern,lambda m:months[m.group()],string)
output:
numericMonths('I was born on June 24 and my sister was born on May 17')
'I was born on 06 24 and my sister was born on 05 17'

regex to remove certain pattern matching of data in python [duplicate]

This question already has answers here:
How do I match a number inside square brackets with regex
(4 answers)
Closed 2 years ago.
can someone help me with the below scenario?
Input:[14][15] In May 2016, she was one of the 12 candidates nominated by the BJP[16][17] to contest the Rajya Sabha elections due on 11 June 2016.[20]
output:In May 2016, she was one of the 12 candidates nominated by the BJP to contest the Rajya Sabha elections due on 11 June 2016.
i'm working in a project where i am doing web crawling to fetch data from wikipedia.The problem is the data is coming in the above format.I need a regex pattern which filters the data dynamically when it finds the numbers coming inside []. It should not remove other numbers.
import re
str = '[14][15] In May 2016, she was one of the 12 candidates nominated by the
BJP[16][17] to contest the Rajya Sabha elections due on 11 June 2016.[20]'
str = re.sub(r'\[\d+]', '', str)
print(str)
output
'In May 2016, she was one of the 12 candidates nominated by the BJP to contest the Rajya Sabha elections due on 11 June 2016.'
You can test your own regular expressions here https://regex101.com/
You can try this
import re
str = "[14][15] In May 2016, she was one of the 12 candidates nominated by the BJP[16][17] to contest the Rajya Sabha elections due on 11 June 2016.[20]"
pattern = '\[[^\]]*\]'
line = re.sub(pattern, '', str)
print(line)
Result
In May 2016, she was one of the 12 candidates nominated by the BJP to contest the Rajya Sabha elections due on 11 June 2016.

Unexpected result in regex - what am I missing?

I am trying to extract immunization records of this form:
Immunization: Tetanus
Other: Booster
Method: Injection
Date Received: 07 Jan 2013
and also of this form:
Immunization: TETANUS DIPTHERIA (TD-ADULT)
Date Received: 07 Dec 2012 # 1155
Location: PORTLAND (OR) VAMC
Reaction:* None Reported
Comments: 1234567
Here is my pattern string:
"Immunization:(.*?)\n[.\n*?]*?Date Received:(.*?)\n"
This is identifying the second pattern and extracting vaccination name and date but not the first pattern. I thought that [.\n*?]*? would take care of the two possibilities (that there are other fields between vaccination name and vaccination date...or not...but this doesn't seem to be doing the trick. What is wrong with my regex and how cna I fix it?
You can use:
import re
matches = re.findall(r"Immunization:\s+(.*?)\s+.*?Date Received:\s+(.*?)$", subject, re.IGNORECASE | re.DOTALL | re.MULTILINE)
Regex Demo | Python Demo
Regex Explanation:
Tested this on pythex with MULTILINE and DOTALL:
Input
Immunization: Tetanus
Other: Booster
Method: Injection
Date Received: 07 Jan 2013
Immunization: TETANUS DIPTHERIA (TD-ADULT)
Date Received: 07 Dec 2012 # 1155
Location: PORTLAND (OR) VAMC
Reaction:* None Reported
Comments: 1234567
Pattern: Immunization:\s+(\w+).*?Date Received:\s+([^\n]+)
Match 1
Tetanus
07 Jan 2013
Match 2
TETANUS
07 Dec 2012 # 1155
Pythex
Pythex with different grouping
The . in [.\n] is taken as a literal '.', not as a symbol for any-character. This is why the date line immediately following the immunisation is accepted but you fail to jump across a character that is not a newline or a dot.
(.*\n)* comes to mind to help you out in the closest way to what you already have. However, it is a bit unfortunate to have so many nested * since this means a long breath for parsing the record and as a human I also find it more difficult to understand. It may be preferable to start every loop with a literal to help the decision making if a loop shall be entered/continued at all.
If I did not mess it up then
Immunization:(.*?)(\n.*)*\nDate Received:(.*)\n
would do without left recursion and "Date Received" would only be detected at the beginning of the line.

Categories