How to regex to match sentences in a pdf? - python

I have the following pdf file I use PyPDF2 to extract text from it pdf image
and I'm looking for a regex to capture numbered sentences in the pdf file
I tried a couple of regex in the following code but the output is not as needed I need to capture the numbered points each as one sentence like this
expected OUTPUT
['1. Please admit that Plaintiff, JOSHUA PINK, received benefits from a collateral
source, as defined by §768.76, Florida Statutes, for medical bills alleged to have been incurred as
a result of the incident described in the Complaint.',2. please.....]
Instead of two regexes I tried either doesn't capture the full sentence or capture it in multiple lines and consider every \n as a new sentence
Extracted TEXT
" \n IN THE CIRCUIT COURT, OF THE \nEIGHTEENTH JUDICIAL CIRCUIT, IN \nAND FOR SEMINOLE COUNTY, \nFLORIDA \n \nCASE NO: 2022 -CA-002235 \n \nJOSHUA PINK, \n \n Plaintiff, \nvs. \n \nMATHEW ZUMBRUM , \n \n Defendant. \n / \n \nDEFENDANT'S REQUEST FOR ADMISSIONS TO PLAINTIFF, JOSHUA PINK \n \n \nCOME NOW the Defendant , MATHEW ZUMBRUM , by and through the undersigned \nattorneys, and pursuant to Rule 1.370, Florida Rul es of Civil Procedure, requests the Plaintiff, \nJOSHUA PINK, admit in this action that each of the following statements are true: \n1. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral \nsource, as defined by §768.76, Florida Statute s, for medical bills alleged to have been incurred as \na result of the incident described in the Complaint. \n2. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral \nsource, as defined by §768.76, Florida Statutes, for loss of wages o r income alleged to have been \nsustained as a result of the incident described in the Complaint. \n3. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal \nInjury Protection portion of an automobile policy for medical bills alleged to have been incurred \nas a result of the incident described in the Complaint. \n Filing # 162442429 E-Filed 12/06/2022 09:46:49 AM\n \n2 4. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal \nInjury Protection portion of an automobile insurance policy for loss of wages or income alleged \nto have been sustained as a result of the incident described in the Complaint. \n5. Please admit that Plaintiff, JOSHUA PINK , received benefits under the medical \npayments provisions of an automobile insurance policy for medical bills alleged to have been \nincurred as a result of the incident described in the Complaint. \n6. Please admit that Plaintiff, JOSHUA PINK , is subject to a deductible under the \nPersonal Injury Protection portion of an automobile insurance policy. \n7. Please admit that Plaintiff, JOSHUA PINK received benefits pursuant to personal \nor group health insurance policy, for medical bills alleged to have been incurred as a result of the \nincident described in the Complaint. \n8. Please admit that Plaintiff, JOSHUA PINK , received benefits pursuant to a \npersonal or group wage continuation plan or policy, for loss of wages or income alleged to have \nbeen sustained as a result of the incident described in the Complaint. \n 9. Please admit that on the date of the accident alleged in your Complaint, Defendant, \nMATHEW ZUMBRUM , complied with and met the security requirements under Chapter \n627.730 - 627.7405, Florida Statutes. \n10. Please admit that Plaintiff, JOSHUA PINK , was partially responsible for the \nsubject accident. \n11. Please admit that Plaintiff, JOSHUA PINK , did NOT suffer a permanent injury as \na result of the subject accident. \nI HEREBY CERTIFY that on the 6th day of December, 2022 a true and correct copy of \nthe foregoing was electronically filed with the Florida Court s E-Filing Portal system which will \n \n3 send a notice of electronic filing to Michael R. Vaughn, Esq., Morgan & Morgan, P.A., 20 N. \nOrange Ave, 16th Floor, Orlando, FL 32801 at mvaughn#forthepeople.com; \njburnham#forthepeople.com; mserrano#forthepeople.com. \nAND REW J. GORMAN & ASSOCIATES \n \nBY: \n \n(Original signed electronically by Attorney.) \nLOURDES CALVO -PAQUETTE, ESQ. \nAttorney for Defendant, Zumbrum \n390 N. Orange Avenue, Suite 1700 \nOrlando, FL 32801 \nTelephone: (407) 872 -2498 \nFacsímile: (855) 369 -8989 \nFlorida Bar No. 0817295 \nE-mail for service (FL R. Jud. Admin. 2.516) : \nflor.law -mlslaw.172o19#statefarm.com \n \nAttorneys and Staff of Andrew J. Gorman & \nAssociates are Employees of the Law Department \nof State Farm Mutual Automobile Insurance \nCompany. \n \n \n\n"
sample output of regex2 (sentence is captured in 2 lines)
[('2022', 'CA-002235 '),
('1', 'Florida Rul es of Civil Procedure, requests the Plaintiff,'),
('1',
'Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral'),
('768',
'Florida Statute s, for medical bills alleged to have been incurred as'),...]
sample output of regex1 (not capturing full sentence)
['1. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral ',
'2. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral ',
'3. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal ',
'2 4. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal ',
'5. Please admit that Plaintiff, JOSHUA PINK , received benefits under the medical ',....]
code:
def read_pdf(name):
reader = PdfReader(name,"rb")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
#regex1 = r'(^[0-9].*)'
regex2 = r'([\d]+).+?([a-zA-Z].+).'
pat = re.compile(regex, re.M)
extracted_text = pat.findall(text)
return text,extracted_text
text,pdf1 = read_pdf(names[0])

I'll provide an answer to go over a couple of different patterns you can use to approach text items like that. Let's say you have a text that is structured like this:
test_str = """
Some preamble.
1. Very
long
sentence.
2. One-line sentence.
3. Another
longer sentence.
A new paragraph.
"""
First scenario: you want to match items that begin with a number followed by a period at the beginning of a line (with optional leading space) and end with a period at the end of a line - irrespective of how many characters it takes, but as few as possible. That's what your question reads like. One pattern that describes this is ^[ \t]*\d+\.[\s\S]*?\.$. The heavy lifting here is done by [\s\S]*? which is a lazy class that just matches any character (by including all spaces and all non-spaces) as few times as possible.
regex1 = re.compile(r"^[ \t]*\d+\.[\s\S]*?\.$", re.MULTILINE)
print(re.findall(regex1, test_str))
Which returns:
[' 1. Very\nlong\nsentence.', ' 2. One-line sentence.', ' 3. Another\nlonger sentence.']
If you want to exclude leading space, you could add a capturing group ^[ \t]*(\d+\.[\s\S]*?\.)$ in which case findall() will only return the captured part. In Python:
regex2 = re.compile(r"^[ \t]*(\d+\.[\s\S]*?\.)$", re.MULTILINE)
print(re.findall(regex2, test_str))
Which returns:
['1. Very\nlong\nsentence.', '2. One-line sentence.', '3. Another\nlonger sentence.']
First scenario, alternative expression: after the leading number, express the match in terms of lines; always get the first line and add every following line as long as the preceding line does not end in a period: ^[ \t]*(\d+\..*(?:[^.]$\r?\n.*)*\.)$. This will be faster than the lazy class in the first example and returns the same as with regex2.
regex3 = re.compile(r"^[ \t]*(\d+\..*(?:[^.]$\r?\n.*)*\.)$", re.MULTILINE)
print(re.findall(regex3, test_str))
Second scenario: we don't care what the sentence(s) end in. Just get complete items, which we'll interpret as the leading number followed by all lines that do not start with another leading number or an entirely new paragraph: ^[ \t]*(\d+\..+$(?:\r?\n(?![ \t]*\d+\.|A new).*)*).
This makes use of a negative lookahead (?![ \t]*\d+\.|A new) to prevent matching lines that start either with a new item number or some non-item text and allows more control over what kind of lines may constitute an item. Return values are the same.
regex4 = re.compile(r"^[ \t]*(\d+\..+$(?:\r?\n(?![ \t]*\d+\.|A new).*)*)", re.MULTILINE)
print(re.findall(regex4, test_str))

If you want to match sentences followed by a dot, you might use:
\b\d+\.[^\S\n][^.]*(?:\.(?=\S)[^.]*)*\.
Explanation
\b A word boundary to prevent a partial word match
\d+\.[^\S\n] Match 1+ digits, a dot and a space
[^.]*(?:\.(?=\S)[^.]*)* Optionally match any character except for dots, and then only match the dot when there is a non whitespace character following.
\. Match a dot
See a regex demo.
A pattern with more punctuation characters:
\b\d+\.[^\S\n][^.!?]*(?:[.!?](?=\S)[^.!?]*)*[.!?]
See another regex demo.

Try this:
(\d+\.\s)(.|\n)*?(?=\d+\.\s|\z|\.\s)
This will match from any number followed by a period and a space to the end of the sentence (period followed by a space) or until the next number followed by a period and a space or the end of the string.
See example here

Recommend using Punkt Sentence Tokenizer or any other NLP package of your choice as writing a general purpose regex to detect sentence can be very tricky unless you have only a very strictly defined pattern with limited scope! For example, if you take only numbered sentences then the following regex might work: "\d\.(.)+[a-z]\."gmi

Related

Tokenize paragraphs by special characters; then rejoin so tokenized segments to reach certain length

I have this long paragraph:
paragraph = "The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy; the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X; the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England; the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss; the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man."
My goal is to split this long paragraph into multiple sentences keeping the sentences around 18 - 30 words each.
There is only one full-stop at the end; so nltk tokenizer is of no use. I can use regex to tokenize; I have this pattern that works in splitting:
regex_special_chars = '([″;*"(§=!‡…†\\?\\]‘)¿♥[]+)'
new_text = re.split(regex_special_chars, paragraph)
The question is how to join this paragraph into a list of multiple sentences that would be around 18 to 30; where possible; because sometimes it's not possible with this regex.
The end result will look like the following list below:
tokenized_paragraph = ['The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy;',
'the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X;',
'the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; ',
'the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; ',
'the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; ',
'the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England;',
'the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; ',
'the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss;',
'the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man.']
if we check the lengths of the end result; we get this many words into each tokenized segment:
[len(sent.split()) for sent in tokenized_paragraph]
[21, 31, 25, 30, 25, 29, 27, 26, 76]
Only the last segment exceeded 30 words (76 words), and that's okay!
Edit
The regex could include a colon : So the last segment would be less than 76
I would suggest using findall instead of split.
Then the regex could be:
(?:\S+\s+)*?(?:\S+\s+){17,29}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)
Break-down:
\S+\s+ a word and the space(s) that follow it
(?:\S+\s+)*?(?:\S+\s+){17,29}: lazily match some words followed by a space (so initially it wont match any) and then greedily match as many words as possible up to 29, but at least 17, and all that ending with white space. The first lazy match is needed for when no match completes with just the greedy part.
\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+): match one more word, terminated by a terminator character, or the end of the string.
So:
regex = r'(?:\S+\s+)*?(?:\S+\s+){18,30}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)'
new_text = re.findall(regex, paragraph)
for line in new_text:
print(len(line.split()), line)
The number of words per paragraph are:
[21, 31, 25, 30, 25, 29, 27, 26, 76]

Regex: Match words at end of line but do not include X

I am trying to get the span of the city name from some addresses, however I am struggling with the required regex. Examples of the address format is below.
flat 1, tower block, 34 long road, Major city
flat 1, tower block, 34 long road, town and parking space
34 short road, village on the river and carpark (7X3 8RG)
The expected text to be captured in each case is "Major city", "town" and "village on the river". The issue is that sometimes "and parking space" or a variant is included in the address. Using a regex such as "(?<=,\s)\w+" would return "town and parking space" in the case of example 2.
The city is always after the last comma of the address.
I have tried to re-work this question but have not successfuly managed to exclude the "and parking space" section.
I have already created a regex that excludes the postcodes this is just included as an answer would ideally allow for that part of the regex to be bolted on the end.
How would I create a regex that starts after the last comma and runs to the end of the address but stops at any "and parking" or postcodes?
You can capture these strings using
,\s*((?:(?!\sand\s)[^,])*)(?=[^,]*$)
,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)
.*,\s*((?:(?!\sand\s)[^,])*)
.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)
See this regex demo or this regex demo.
Details:
, - a comma ]
\s* - zero or more whitespaces
((?:(?!\sand\s)[^,])*) - Group 1: any char other than a comma, zero or more occurrences, that does not start whitespace + and + whitespace char sequence
(?=[^,]*$) - there must be any zero or more chars other than a comma till end of string.
In Python, you would use
m = re.search(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)', text)
if m:
print(m.group(1))
See the demo:
import re
texts = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']
rx = re.compile(r'.*,\s*([^,]*?)(?=(?:\sand\s[^,]*)?$)')
for text in texts:
m = re.search(rx, text)
if m:
print(m.group(1))
Output:
Major city
town
village on the river
I would do:
import re
exp = ['flat 1, tower block, 34 long road, Major city',
'flat 1, tower block, 34 long road, town and parking space',
'34 short road, village on the river and carpark (7X3 8RG)']
for e in (re.split(',\s*', x)[-1] for x in exp):
print(re.sub(r'(?:\s+and car.*)|(?:\s+and parking.*)','',e))
Prints:
Major city
town
village on the river
Works like this:
Split the string on ,\s* and take the last portion;
Remove anything from the end of that string that starts with the specified (?:\s+and car.*)|(?:\s+and parking.*)
You can easily add addition clauses to remove with this approach.

REGEX: Remove spaces between strings with one or two letters

Consider the following original strings showed in the first columns of the following table:
Original String Parsed String Desired String
'W. & J. JOHNSON LMT.COM' #W J JOHNSON LIMITED #WJ JOHNSON LIMITED
'NORTH ROOF & WORKS CO. LTD.' #NORTH ROOF WORKS CO LTD #NORTH ROOF WORKS CO LTD
'DAVID DOE & CO., LIMITED' #DAVID DOE CO LIMITED #DAVID DOE CO LIMITED
'GEORGE TV & APPLIANCE LTD.' #GEORGE TV APPLIANCE LTD #GEORGE TV APPLIANCE LTD
'LOVE BROS. & OTHERS LTD.' #LOVE BROS OTHERS LTD #LOVE BROS OTHERS LTD
'A. B. & MICHAEL CLEAN CO. LTD.'#A B MICHAEL CLEAN CO LTD #AB MICHAEL CLEAN CO LTD
'C.M. & B.B. CLEANER INC.' #C M B B CLEANER INC #CMBB CLEANER INC
Punctuation needs to be removed which I have done as follows:
def transform(word):
word = re.sub(r'(?<=[A-Za-z])\'(?=[A-Za-z])[A-Z]|[^\w\s]|(.com|COM)',' ',word)
However, there is one last point which I have not been able to get. After removing punctuations I ended up with lots of spaces. How can I have a regular expression that put together initials and keep single spaces for regular words (no initials)?
Is this a bad approach to substitute the mentioned characters to get the desired strings?
Thanks for allowing me to continue learning :)
I think it's simpler to do this in parts. First, remove .com and any punctuation other than space or &. Then, remove a space or & surrounded by only one letter. Finally, replace any remaining sequence of space or & with a single space:
import re
strings = ['W. & J. JOHNSON LMT.COM',
'NORTH ROOF & WORKS CO. LTD.',
'DAVID DOE & CO., LIMITED',
'GEORGE TV & APPLIANCE LTD.',
'LOVE BROS. & OTHERS LTD.',
'A. B. & MICHAEL CLEAN CO. LTD.',
'C.M. & B.B. CLEANER INC.'
]
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z& ]+', '', s, 0, re.IGNORECASE)
s = re.sub(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)', '', s)
s = re.sub(r'\s*[& ]\s*', ' ', s)
print s
Output
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Demo on rextester
Update
This was written before the edit to the question changing the required result for the last data. Given the edit, the above code can be simplified to
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z ]+|\s(?=&)|(?<!\w\w)\s+(?!\w\w)', '', s, 0, re.IGNORECASE)
print s
Demo on rextester
Doing this in regex alone won't be pretty and is not the best solution, yet, here it is! You're better off doing a multiple step approach. What I've done is identified all the cases that are possible and opted to find a solution where there's no replacement string since you're not always replacing the characters with spaces.
Rules
Non "Stacked" Abbreviations
These are locations like A. B. or W. & J., but not C.M. & B.B.
I've identified these as locations where an abbreviation part (e.g. A.) exists before and after, but the latter is not followed by another alpha character
Preceding Space
These locations don't exist in your text but could if a space preceded a non-alpha character without a space following it (say at the end of a line)
We match the characters after the first space in these cases
Proceeding Space
These are locations like & and the dot in J.
We match the character before the last space in those examples
No Spaces
These are locations like 'LOVE (the apostrophe in that string)
We only match the non-alpha-non-whitespace characters
Regex
An all-in-one regex that accomplishes this is as follows:
See regex in use here
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )
Works as follows (broken into each alternation):
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z])) matches non-alpha characters between A. and B. but not A. and B.B
(?<=\b[a-z]) positive lookbehind ensuring what precedes is an alpha character and assert a word boundary position to its left
[^a-z]+ match any non-alpha character one or more times
(?=[a-z]\b(?![^a-z][a-z])) positive lookahead ensuring the following exists
[a-z]\b match any alpha character and assert a word boundary position to its right
(?![^a-z][a-z]) negative lookahead ensuring what follows is not a non-alpha character followed by an alpha character
(?<= ) *(?:\.com\b|[^a-z\s]+) * ensures a space precedes, then matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces
(?<= ) positive lookbehind ensuring a space precedes
* match any number of spaces
(?:\.com\b|[^a-z\s]+) match .com and ensure a non-word character follows, or match any non-word-non-whitespace character one or more times
* match any number of spaces
*(?:\.com\b|[^a-z\s]+) *(?= ) matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces, then ensures a space follows
Same as previous but instead of the positive lookbehind at the beginning, there's a positive lookahead at the end
(?<! )(?:\.com\b|[^a-z\s]+)(?! ) matches .com or any non-alpha-non-whitespace characters one or more times ensuring no spaces surround it
Same as previous two options but uses negative lookbehind and negative lookahead
Code
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'"
]
r = re.compile(r'(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )', re.IGNORECASE)
def transform(word):
return re.sub(r, '', word)
for s in strings:
print(transform(s))
Outputs:
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Edit
Using a callback, you can extend this logic to include special cases as mentioned in a comment below my answer to match specific cases and have conditional replacements.
These special cases include:
FONTAINE'S to FONTAINE
PREMIUM-FIT AUTO to PREMIUM FIT AUTO
62325 W.C. to 62325 WC
I added a new alternation to the regex: (\b[\'-]\b(?:[a-z\d] )?) to capture 'S or - between letters (also -S or similar) and replace it with a space using the callback (if the capture group exists).
I still suggest using multiple regular expressions to accomplish this, but I wanted to show that it is possible with a single pattern.
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'",
"'FONTAINE'S PREMIUM-FIT AUTO 62325 W.C.'"
]
r = re.compile(r'(?<=\b[a-z\d])[^a-z\d]+(?=[a-z\d]\b(?![^a-z\d][a-z\d]))|(?<= ) *(?:\.com\b|[^a-z\d\s]+) *| *(?:\.com\b|[^a-z\d\s]+) *(?= )|(\b[\'-]\b(?:[a-z\d] )?)|(?<! )(?:\.com\b|[^a-z\d\s]+)(?! )', re.IGNORECASE)
def repl(m):
return ' ' if m.group(1) else ''
for s in strings:
print(r.sub(repl, s))
Here's the simplest I could get it with one regex pattern:
\.COM|(?<![A-Z]{2}) (?![A-Z]{2})|[.&,]| (?>)&
Basically, it removes characters that fit 3 criteria:
Literal ".COM"
Spaces that are not preceded or followed by 2 capital letters
Dots, ampersands, and commas, regardless of where they appear
Spaces followed by ampersands
Demo: https://regex101.com/r/EMHxq9/2

regex capture text in brackets, omitting optional prefix

I'm trying to convert some documents (Wikipedia articles) which contain links with a specific markdown convention. I want to render these to be reader-friendly without links. The convention is:
Names in double-brackets with of the pattern [[Article Name|Display Name]] should be captured ignoring the pipe and preceding text as well as enclosing brackets:
Display Name.
Names in double-brackets of the pattern [[Article Name]] should be
captured without the brackets: Article Name.
Nested approach (produces desired result)
I know I can handle #1 and #2 in a nestedre.sub() expression. For example, this does what I want:
s = 'including the [[Royal Danish Academy of Sciences and Letters|Danish Academy of Sciences]], [[Norwegian Academy of Science and Letters|Norwegian Academy of Sciences]], [[Russian Academy of Sciences]], and [[National Academy of Sciences|US National Academy of Sciences]].'
re.sub('\[\[(.*?\|)(.*?)\]\]','\\2', # case 1
re.sub('\[\[([^|]+)\]\]','\\1',s) # case 2
)
# result is correct:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.'
Single-pass approach (looking for solution here)
For efficiency and my own improvement, I would like to know whether there is a single-pass approach.
What I have tried: In an optional group 1, I want to greedy-capture everything between [[ and a | (if it exists). Then in group 2, I want to capture everything else up to the ]]. Then I want to return only group 2.
My problem is in making the greedy capture optional:
re.sub('\[\[([^|]*\|)?(.*?)\]\]','\\2',s)
# does NOT return the desired result:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, US National Academy of Sciences.'
# is missing: 'Russian Academy of Sciences, and '
See regex in use here
\[{2}(?:(?:(?!]{2})[^|])+\|)*((?:(?!]{2})[^|])+)]{2}
\[{2} Match [[
(?:(?:(?!]{2})[^|])+\|)* Matches the following any number of times
(?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
\| Matches | literally
((?:(?!]{2})[^|])+) Capture the following into capture group 1
(?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
]{2} Match ]]
Replacement \1
Result:
including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.
Another alternative that may work for you is the following. It's less specific than the regex above but doesn't include any lookarounds.
\[{2}(?:[^]|]+\|)*([^]|]+)]{2}

Splitting a string with no line breaks into a list of lines with a maximum column count

I have a long string (multiple paragraphs) which I need to split into a list of line strings. The determination of what makes a "line" is based on:
The number of characters in the line is less than or equal to X (where X is a fixed number of columns per line_)
OR, there is a newline in the original string (that will force a new "line" to be created.
I know I can do this algorithmically but I was wondering if python has something that can handle this case. It's essentially word-wrapping a string.
And, by the way, the output lines must be broken on word boundaries, not character boundaries.
Here's an example of input and output:
Input:
"Within eight hours of Wilson's outburst, his Democratic opponent, former-Marine Rob Miller, had received nearly 3,000 individual contributions raising approximately $100,000, the Democratic Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a strong national defense and reining in the size of government, won a special election to the House in 2001, succeeding the late Rep. Floyd Spence, R-S.C. Wilson had worked on Spence's staff on Capitol Hill and also had served as an intern for Sen. Strom Thurmond, R-S.C."
Output:
"Within eight hours of Wilson's outburst, his"
"Democratic opponent, former-Marine Rob Miller,"
" had received nearly 3,000 individual "
"contributions raising approximately $100,000,"
" the Democratic Congressional Campaign Committee"
" said."
""
"Wilson, a conservative Republican who promotes a "
"strong national defense and reining in the size "
"of government, won a special election to the House"
" in 2001, succeeding the late Rep. Floyd Spence, "
"R-S.C. Wilson had worked on Spence's staff on "
"Capitol Hill and also had served as an intern"
" for Sen. Strom Thurmond, R-S.C."
EDIT
What you are looking for is textwrap, but that's only part of the solution not the complete one. To take newline into account you need to do this:
from textwrap import wrap
'\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
>>> print '\n'.join(['\n'.join(wrap(block, width=50)) for block in text.splitlines()])
Within eight hours of Wilson's outburst, his
Democratic opponent, former-Marine Rob Miller, had
received nearly 3,000 individual contributions
raising approximately $100,000, the Democratic
Congressional Campaign Committee said.
Wilson, a conservative Republican who promotes a
strong national defense and reining in the size of
government, won a special election to the House in
2001, succeeding the late Rep. Floyd Spence,
R-S.C. Wilson had worked on Spence's staff on
Capitol Hill and also had served as an intern for
Sen. Strom Thurmond
You probably want to use the textwrap function in the standard library:
http://docs.python.org/library/textwrap.html

Categories