regex to remove certain pattern matching of data in python [duplicate] - python

This question already has answers here:
How do I match a number inside square brackets with regex
(4 answers)
Closed 2 years ago.
can someone help me with the below scenario?
Input:[14][15] In May 2016, she was one of the 12 candidates nominated by the BJP[16][17] to contest the Rajya Sabha elections due on 11 June 2016.[20]
output:In May 2016, she was one of the 12 candidates nominated by the BJP to contest the Rajya Sabha elections due on 11 June 2016.
i'm working in a project where i am doing web crawling to fetch data from wikipedia.The problem is the data is coming in the above format.I need a regex pattern which filters the data dynamically when it finds the numbers coming inside []. It should not remove other numbers.

import re
str = '[14][15] In May 2016, she was one of the 12 candidates nominated by the
BJP[16][17] to contest the Rajya Sabha elections due on 11 June 2016.[20]'
str = re.sub(r'\[\d+]', '', str)
print(str)
output
'In May 2016, she was one of the 12 candidates nominated by the BJP to contest the Rajya Sabha elections due on 11 June 2016.'
You can test your own regular expressions here https://regex101.com/

You can try this
import re
str = "[14][15] In May 2016, she was one of the 12 candidates nominated by the BJP[16][17] to contest the Rajya Sabha elections due on 11 June 2016.[20]"
pattern = '\[[^\]]*\]'
line = re.sub(pattern, '', str)
print(line)
Result
In May 2016, she was one of the 12 candidates nominated by the BJP to contest the Rajya Sabha elections due on 11 June 2016.

Related

Delete all the digits from a string except the digits that are followed by given letter using re.sub() in python3 [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 2 years ago.
Improve this question
I know how to remove all digits from a string using re.sub(). But I don't know how to remove all digits from a string except some special ones.
For example, let's say I have the string below:
"On 11 December 2008, India entered the 3G arena"
And I want the output as:
"On December, India entered the 3G arena"
You may use a Negative Lookahead (?!...) to ensure that content following the digit is not a letter you set
Here an example where all digits followed by any of there char GJK are not concerned by the suppression
import re
print(re.sub(r"\d(?![GJK])", "", "On 11 December 2008, India entered the 3G arena 1A 3J 5K"))
# On December , India entered the 3G arena A 3J 5K
You might use \b (word boundary) to delete numbers which are not apparently part of words, following way:
import re
txt = "On 11 December 2008, India entered the 3G arena"
cleaned = re.sub(r'\b \d+\b','',txt)
print(cleaned)
Output:
On December, India entered the 3G arena
Note that there is space before \d+ as otherwise you would end with doubled spaces. This solution assumed that digits to remove are always after space, if this does not hold true you might use r'\b\d+\b' and then remove superflouos spaces.
While azro's answer covers the general case, here's a solution to remove numbers around month names:
import calendar
month_names = '|'.join([calendar.month_name[i] for i in range(1,13)])
s = "On 11 December 2008, India entered the 3G arena"
re.sub(fr"\d+\s+({month_names})\s+\d+", r"\1", s)
#'On December, India entered the 3G arena'

Python Regex - using re.sub to clean up a string

I am having some problems using regex sub to remove numbers from strings. Input strings can look like:
"The Term' means 125 years commencing on and including 01 October 2015."
"125 years commencing on 25th December 1996"
"the term of 999 years from the 1st January 2011"
What I want to do is remove the number and the word 'years' - I am also parsing the string for dates using DateFinder, but DateFinder interprets the number as a date - hence why I want to remove the number.
Any thoughts on the regex expression to remove the number and the word 'years'?
I think this does what you want:
import re
my_list = ["The Term' means 125 years commencing on and including 01 October 2015.",
"125 years commencing on 25th December 1996",
"the term of 999 years from the 1st January 2011",
]
for item in my_list:
new_item = re.sub("\d+\syears", "", item)
print(new_item)
results:
The Term' means commencing on and including 01 October 2015.
commencing on 25th December 1996
the term of from the 1st January 2011
Note, you will end up with some extra white space (maybe you want that)? But you could also add this to 'clean' that up:
new_item = re.sub("\s+", " ", new_item)
because I love regexes: new_item = re.sub("^\s+|\s+$", "", new_item)
new_item = new_item.strip()
try this to remove numbers and word years:
re.sub(r'\s+\d+|\s+years', '', text)
if for instance:
text="The Term' means 125 years commencing on and including 01 October 2015."
then the output will be:
"The Term' means commencing on and including October."

Unexpected result in regex - what am I missing?

I am trying to extract immunization records of this form:
Immunization: Tetanus
Other: Booster
Method: Injection
Date Received: 07 Jan 2013
and also of this form:
Immunization: TETANUS DIPTHERIA (TD-ADULT)
Date Received: 07 Dec 2012 # 1155
Location: PORTLAND (OR) VAMC
Reaction:* None Reported
Comments: 1234567
Here is my pattern string:
"Immunization:(.*?)\n[.\n*?]*?Date Received:(.*?)\n"
This is identifying the second pattern and extracting vaccination name and date but not the first pattern. I thought that [.\n*?]*? would take care of the two possibilities (that there are other fields between vaccination name and vaccination date...or not...but this doesn't seem to be doing the trick. What is wrong with my regex and how cna I fix it?
You can use:
import re
matches = re.findall(r"Immunization:\s+(.*?)\s+.*?Date Received:\s+(.*?)$", subject, re.IGNORECASE | re.DOTALL | re.MULTILINE)
Regex Demo | Python Demo
Regex Explanation:
Tested this on pythex with MULTILINE and DOTALL:
Input
Immunization: Tetanus
Other: Booster
Method: Injection
Date Received: 07 Jan 2013
Immunization: TETANUS DIPTHERIA (TD-ADULT)
Date Received: 07 Dec 2012 # 1155
Location: PORTLAND (OR) VAMC
Reaction:* None Reported
Comments: 1234567
Pattern: Immunization:\s+(\w+).*?Date Received:\s+([^\n]+)
Match 1
Tetanus
07 Jan 2013
Match 2
TETANUS
07 Dec 2012 # 1155
Pythex
Pythex with different grouping
The . in [.\n] is taken as a literal '.', not as a symbol for any-character. This is why the date line immediately following the immunisation is accepted but you fail to jump across a character that is not a newline or a dot.
(.*\n)* comes to mind to help you out in the closest way to what you already have. However, it is a bit unfortunate to have so many nested * since this means a long breath for parsing the record and as a human I also find it more difficult to understand. It may be preferable to start every loop with a literal to help the decision making if a loop shall be entered/continued at all.
If I did not mess it up then
Immunization:(.*?)(\n.*)*\nDate Received:(.*)\n
would do without left recursion and "Date Received" would only be detected at the beginning of the line.

Extracting words next to a location or Duration in python

How can i extract words next to a location or Duration? What is the best possible regex in python to do this action?
Example:-
Kathick Kumar, Bangalore who was a great person and lived from 29th March 1980 - 21 Dec 2014.
In the above example i want to extract the words before location and the words before duration. Here the location and duration is not fixed, what will be the best possible regex for this in python? Or can we do this using nltk?
Desired output:-
Output-1: Karthick Kumar (Keyword here is Location)
Output-2: who was a great person and lived from (Keyword here is duration)
I suggest using Lookaheads.
In your example, assuming you want the words before Bangalore and 29th March 1980 - 21 Dec 2014, you could use lookaheads( and lookbehinds) to get the relevant match.
I've used this regex: (.*)(?>Bangalore)(.+)(?=29th March 1980 - 21 Dec 2014) and captured the text in parentheses, which can be accessed by using \1 and \2.
DEMO

Python string split without common delimiter

I am fairly new to Python. An external simulation software I use gives me reports which include data in the following format:
1 29 Jan 2013 07:33:19.273 29 Jan 2013 09:58:10.460 8691.186
I am looking to split the above data into four strings namely;
'1', '29 Jan 2013 07:33:19.273', '29 Jan 2013 09:58:10.460', '8691.186'
I cannot use str.split since it splits out the date into multiple strings. There appears to be four white spaces between 1 and the first date and between the first and second dates. I don't know if this is four white spaces or tabs.
Using '\t' as a delimiter on split doesn't do much. If I specify ' ' (4 spaces) as a delimiter, I get the first three strings. I also then get an empty string and leading spaces in the final string. There are 10 spaces between the second date and the number.
Any suggestions on how to deal with this would be much helpful!
Thanks!
You can split on more than one space with a simple regular expression:
import re
multispace = re.compile(r'\s{2,}') # 2 or more whitespace characters
fields = multispace.split(inputline)
Demonstration:
>>> import re
>>> multispace = re.compile(r'\s{2,}') # 2 or more whitespace characters
>>> multispace.split('1 29 Jan 2013 07:33:19.273 29 Jan 2013 09:58:10.460 8691.186')
['1', '29 Jan 2013 07:33:19.273', '29 Jan 2013 09:58:10.460', '8691.186']
If the data is fixed width you can use character addressing in the string
n=str[0]
d1=str[2:26]
d2=str[27:51]
l=str[52:]
However, if Jan 02 is shown as Jan 2 this may not work as the width of the string may be variable

Categories