Remove duplicated puntaction in a string - python

I'm working on a cleaning some text as the one bellow:
Great talking with you. ? See you, the other guys and Mr. Jack Daniels next week, I hope-- ? Bobette ? ? Bobette Riner??????????????????????????????? Senior Power Markets Analyst?????? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com ? ? - cinhrly020101.doc
It has multiple spaces and question marks, to clean it I'm using regular expressions:
def remove_duplicate_characters(text):
text = re.sub("\s+"," ",text)
text = re.sub("\s*\?+","?",text)
text = re.sub("\s*\?+","?",text)
return text
remove_duplicate_characters(msg)
remove_duplicate_characters(msg)
Which gives me the following result:
'Great talking with you.? See you, the other guys and Mr. Jack Daniels next week, I hope--? Bobette? Bobette Riner? Senior Power Markets Analyst? TradersNews Energy 713/647-8690 FAX: 713/647-7552 cell: 832/428-7008 bobette.riner#ipgdirect.com http://www.tradersnewspower.com? - cinhrly020101.doc'
For this particular case, it does work, but does not looks like the best approach if I want to add more charaters to remove. Is there an optimal way to solve this?

To replace all consecutive punctuation chars with their single occurrence you can use
re.sub(r"([^\w\s]|_)\1+", r"\1", text)
If the leading whitespace must be removed, use the r"\s*([^\w\s]|_)\1+" regex.
See the regex demo online.
In case you want to introduce exceptions to this generic regex, you may add an alternative on the left where you'd capture all the contexts where you wat the consecutive punctuation to be kept:
re.sub(r'((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+', r'\1\2', text)
See this regex demo.
The ((?<!\.)\.{3}(?!\.)|://)|([^\w\s]|_)\2+ regex matches and captures a ... (not encosed with other dots on both ends) and a :// string (commonly seen in URLS), and the rest is the original regex with the adjusted backreference (since now, there are two capturing groups).
The \1\2 in the replacement pattern put back the captured vaues into the resulting string.

Related

Hyphen character '-' creating issues when using regular expressions for BeautifulSoup

I am learning how to webscrape with python using a Wikepedia article. I managed to get the data I needed, the tables, by using the .get_text() method on the table rows ().
I am cleaning up the data in Pandas and one of the routines involves getting the date a book or movie was published. Since there are many ways in which this can occur such as:
(1986)
(1986-1989)
(1986-present)
Currently, I am using the code below which works on a test sentence:
# get the first columns of row 19 from the table and get its text
test = data_collector[19].find_all('td')[0]
text = test.get_text()
#create and test the pattern
pattern = re.compile('\(\d\d\d\d\)|\(\d\d\d\d-\d\d\d\d\)|\(\d\d\d\d-[ Ppresent]*\)')
re.findall(pattern, 'This is Agent (1857), the years were (1987-1868), which lasted from (1678- Present)')
I get the expected output on the test sentence.
['(1857)', '(1987-1868)', '(1678- Present)']
However, when I test it on a particular piece of text from the wiki article 'The Adventures of Sherlock Holmes (1891–1892) (series), (1892) (novel), Arthur Conan Doyle\n', I am able to extract (1892), but NOT (1891-1892).
text = test.get_text()
re.findall(pattern, text)
o/p: ['(1892)']
Even as I type this, I can see that the hyphen that I am using and the one on the text are different. I am sure that this is the issue and was hoping if someone could tell me what this particular symbol is called and how I can actually "type" it using my keyboard.
Thank you!
I suggest enhancing the pattern to search for the most common hyphens, -, – and —, and fix the present pattern from a character class to a char sequence (so as not to match sent with [ Ppresent]*):
re.compile(r'\(\d{4}(?:[\s–—-]+(?:\d{4}|present))?\)', re.I)
See the regex demo. Note that re.I flag will make the regex match in a case insensitive way.
Details
\( - a (
\d{4} - four digits ({4} is a limiting quantifier that repeats the pattern it modifies four times)
(?:[\s–—-]+(?:\d{4}|present))? - an optional (as there is a ? at the end) non-capturing (due to ?:) group matching 1 or 0 occurrences of
[\s–—-]+ - 1 or more whitespaces, -, — or –
(?:\d{4}|present) - either 4 digits or present
\) - a ) char.
If you plan to match any hyphens use [\u002D\u058A\u05BE\u1400\u1806\u2010-\u2015\u2E17\u2E1A\u2E3A\u2E3B\u2E40\u301C\u3030\u30A0\uFE31\uFE32\uFE58\uFE63\uFF0D\s]+ instead of [\s–—-]+.
Or, to match any 1+ non-word chars at that location, probably, other than ( and ), use [^\w()]+ instead: re.compile(r'\(\d{4}(?:[^\w()]+(?:\d{4}|present))?\)', re.I).

Python Regex for Clinical Trials Fields

I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.

Using regex, extract quoted strings that may contain nested quotes

I have the following string:
'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'
Now, I wish to extract the following quotes:
1. Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!
2. How Doth the Little Busy Bee,
3. I'll try again.
I tried the following code but I'm not getting what I want. The [^\1]* is not working as expected. Or is the problem elsewhere?
import re
s = "'Well, I've tried to say \"How Doth the Little Busy Bee,\" but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.'"
for i, m in enumerate(re.finditer(r'([\'"])(?!(?:ve|m|re|s|t|d|ll))(?=([^\1]*)\1)', s)):
print("\nGroup {:d}: ".format(i+1))
for g in m.groups():
print(' '+g)
If you really need to return all the results from a single regular expression applied only once, it will be necessary to use lookahead ((?=findme)) so the finding position goes back to the start after each match - see this answer for a more detailed explanation.
To prevent false matches, some clauses are also needed regarding the quotes that add complexity, e.g. the apostrophe in I've shouldn't count as an opening or closing quote. There's no single clear-cut way of doing this but the rules I've gone for are:
An opening quote must not be immediately preceeded by a word character (e.g. letter). So for example, A" would not count as an opening quote but ," would count.
A closing quote must not be immediately followed by a word character (e.g. letter). So for example, 'B would not count as a closing quote but '. would count.
Applying the above rules leads to the following regular expression:
(?=(?:(?<!\w)'(\w.*?)'(?!\w)|\"(\w.*?)\"(?!\w)))
Debuggex Demo
A good quick sanity check test on any possible candidate regular expression is to reverse the quotes. This has been done in this regex101 demo.
EDIT
I modified my regex, it match properly even more complicated cases:
(?=(?<!\w|[!?.])('|\")(?!\s)(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w))
DEMO
It is now even more complicated, the main improvement is not matching directly after some of punctuation character ([!?.]) and better quote case separation. Verified on diversified examples.
The sentence will be in content captured group. Of course it has some restrictions, releted to usage of whitespaces, etc. But it should work with most of proper formatted sentences - or at least it work with examples.
(?=(?<!\w|[!?.])('|\")(?!\s) - match the ' or " not preceded by word or punctuation character ((?<!\w|[!?.])) or not fallowed by whitespace((?!\s)), the ' or " part is captured in group 1 to further use,
(?P<content>(?:.(?!(?<=(?=\1).)(?!\w)))*)\1(?!\w)) - match sentence, followed by
same char (' or " captured in group 1) as it was started, ignore other quotes
It doesn't match whole sentence directly, but with capturing group nested in lookaround construct, so with global match modifier it will match also sentences inside sentences - because it directly match only the place before sentence starts.
About your regex:
I suppose, that by [^\1]* you meant any char but not one captured in group 1, but character class doesn't work this way, because it treats \1 as an char in octal notation (which I think is some kind of whitespace) not a reference to capturing group. Take a look on this example - read explanation. Also compare matching of THIS and THIS regex.
To achieve what you want, you should use lookaround, something like this: (')((?:.(?!\1))*.) - capture the opening char, then match every char which is not followed by captured opening char, then capture one more char, which is directly before captured char - and you have whole content between chars you excluded.
This is a great question for Python regex because sadly, in my opinion the re module is one of the most underpowered of mainstream regex engines. That's why for any serious regex work in Python, I turn to Matthew Barnett's stellar regex module, which incorporates some terrific features from Perl, PCRE and .NET.
The solution I'll show you can be adapted to work with re, but it is much more readable with regex because it is made modular. Also, consider it as a starting block for more complex nested matching, because regex lets you write recursive regular expressions similar to those found in Perl and PCRE.
Okay, enough talk, here's the code (a mere four lines apart from the import and definitions). Please don't let the long regex scare you: it is long because it is designed to be readable. Explanations follow.
The Code
import regex
quote = regex.compile(r'''(?x)
(?(DEFINE)
(?<qmark>["']) # what we'll consider a quotation mark
(?<not_qmark>[^'"]+) # chunk without quotes
(?<a_quote>(?P<qopen>(?&qmark))(?&not_qmark)(?P=qopen)) # a non-nested quote
) # End DEFINE block
# Start Match block
(?&a_quote)
|
(?P<open>(?&qmark))
(?&not_qmark)?
(?P<quote>(?&a_quote))
(?&not_qmark)?
(?P=open)
''')
str = """'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I will try again.'"""
for match in quote.finditer(str):
print(match.group())
if match.group('quote'):
print(match.group('quote'))
The Output
'Well, I have tried to say "How Doth the Little Busy Bee," but it all came different!'
"How Doth the Little Busy Bee,"
'I will try again.'
How it Works
First, to simplify, note that I have taken the liberty of converting I'll to I will, reducing confusion with quotes. Addressing I'll would be no problem with a negative lookahead, but I wanted to make the regex readable.
In the (?(DEFINE)...) block, we define the three sub-expressions qmark, not_qmark and a_quote, much in the way that you define variables or subroutines to avoid repeating yourself.
After the definition block, we proceed to matching:
(?&a_quote) matches an entire quote,
| or...
(?P<open>(?&qmark)) matches a quotation mark and captures it to the open group,
(?&not_qmark)? matches optional text that is not quotes,
(?P<quote>(?&a_quote)) matches a full quote and captures it to the quote group,
(?&not_qmark)? matches optional text that is not quotes,
(?P=open) matches the same quotation mark that was captured at the opening of the quote.
The Python code then only needs to print the match and the quote capture group if present.
Can this be refined? You bet. Working with (?(DEFINE)...) in this way, you can build beautiful patterns that you can later re-read and understand.
Adding Recursion
If you want to handle more complex nesting using pure regex, you'll need to turn to recursion.
To add recursion, all you need to do is define a group and refer to it using the subroutine syntax. For instance, to execute the code within Group 1, use (?1). To execute the code within group something, use (?&something). Remember to leave an exit for the engine by either making the recursion optional (?) or one side of an alternation.
References
Pre-defined regex subroutines
Named capture groups
It seems difficult to achieve with juste one regex pass, but it could be done with a relatively simple regex and a recursive function:
import re
REGEX = re.compile(r"(['\"])(.*?[!.,])\1", re.S)
S = """'Well, I've tried to say "How Doth the Little Busy Bee," but it all came different!' Alice replied in a very melancholy voice. She continued, 'I'll try again.' 'And we may now add "some more 'random test text'.":' "Yes it seems to be a good idea!" 'ok, let's go.'"""
def extract_quotes(string, quotes_list=None):
list = quotes_list or []
list += [found[1] for found in REGEX.findall(string)]
print("found: {}".format(quotes_list))
index = 0
for quote in list[:]:
index += 1
sub_list = extract_quotes(quote)
list = list[:index] + sub_list + list[index:]
index += len(sub_list)
return list
print extract_quotes(S)
This prints:
['Well, I\'ve tried to say "How Doth the Little Busy Bee," but it all came different!', 'How Doth the Little Busy Bee,', "I'll try again.", 'And we may now add "some more \'random test text\'.":\' "Yes it seems to be a good idea!" \'ok, let\'s go.', "some more 'random test text'.", 'Yes it seems to be a good idea!']
Note that the regex uses the punctuation to determine if a quoted text is a "real quote". in order to be extracted, a quote need to be ended with a punctuation character before the closing quote. That is 'random test text' is not considered as an actual quote, while 'ok let's go.' is.
The regex is pretty simple, I think it does not need explanation.
Thue extract_quotes function find all quotes in the given string and store them in the quotes_list. Then, it calls itself for each found quote, looking for inner quotes...

How do I append a list of negative lookbehinds to a python regular expression?

I'm trying to split a paragraph into sentences using regex split and I'm trying to use the second answer posted here:
a Regex for extracting sentence from a paragraph in python
But I have a list of abbreviations that I don't want to end the sentence on even though there's a period. But I don't know how to append it to that regular expression properly. I'm reading in the abbreviations from a file that contains terms like Mr. Ms. Dr. St. (one on each line).
Short answer: You can't, unless all lookbehind assertions are of the same, fixed width (which they probably aren't in your case; your example contained only two-letter abbreviations, but Mrs. would break your regex).
This is a limitation of the current Python regex engine.
Longer answer:
You could write a regex like (?s)(?<!.Mr|Mrs|.Ms|.St)\., padding each alternating part of the lookbehind assertion with as many .s as needed to get all of them to the same width. However, that would fail in some circumstances, for example when a paragraph begins with Mr..
Anyway, you're not using the right tool here. Better use a tool designed for the job, for example the Natural Language Toolkit.
If you're stuck with regex (too bad!), then you could try and use a findall() approach instead of split():
(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*
would match a sentence that ends in . (optionally followed by whitespace) and may contain no dots unless preceded by one of the allowed abbreviations.
>>> import re
>>> s = "My name is Mr. T. I pity the fool who's not on the A-Team."
>>> re.findall(r"(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*", s)
['My name is Mr. T. ', "I pity the fool who's not on the A-Team."]
I don't directly answer your question, but this post should contain enough information for you to write a working regex for your problem.
You can append a list of negative look-behinds. Remember that look-behinds are zero-width, which means that you can put as many look-behinds as you want next to each other, and you are still look-behind from the same position. As long as you don't need to use "many" quantifier (e.g. *, +, {n,}) in the look-behind, everything should be fine (?).
So the regex can be constructured like this:
(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+
It is a bit too verbose. Anyway, I write this post just to demonstrate that it is possible to look-behind on a list of fixed string.
Example run:
>>> s = 'something patterning of patterned crap patternon not patterner, not allowed patternes to patternsses, patternet'
>>> re.findall(r'(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+', s)
['patterning', 'patternon', 'patternet']
There is a catch in using look-behind, though. If there are dynamic number of spaces between the blacklisted text and the text matching the pattern, the regex above will fail. I really doubt there exists a way to modify the regex so that it works for the case above while keeping the look-behinds. (You can always replace consecutive spaces into 1, but it won't work for more general cases).

How to extract with excluding some characters by python regex

I have been using python regex to extract address patterns.
For example, i have a list of add as below:
12buixuongtrach
34btrannhatduat
25bachmai
78bhoangquocviet
i want to refine the addresses like these:
12 buixuongtrach
34b trannhatduat
23 bachmai
78b hoangquocviet
Anyone please help some hint code?
Many thanks
You can use a pretty simple regex to split the numbers off from the letters, but like people have said in the comments, there's no way to know when those b's should be part of the number and when they're part of the text.
import re
text = """12buixuongtrach
34btrannhatduat
25bachmai
78bhoangquocviet"""
unmatched = text.split()
matched = [re.sub('(\d+)(.*)', '\\1 \\2', s) for s in unmatched]
Which gives:
>>> matched
['12 buixuongtrach', '34 btrannhatduat', '25 bachmai', '78 bhoangquocviet']
The regex is just grabbing one or more digits at the start of the string and putting them into group \1, then putting the rest of the string into group \2.
Thanks all for your response. i finally found a work around.
I used the pattern as below and it works like a charm :)
'[a-zA-Z]+|[\/0-9abcd]+(?!a|u|c|h|o|e)'

Categories