I'm trying to convert some documents (Wikipedia articles) which contain links with a specific markdown convention. I want to render these to be reader-friendly without links. The convention is:
Names in double-brackets with of the pattern [[Article Name|Display Name]] should be captured ignoring the pipe and preceding text as well as enclosing brackets:
Display Name.
Names in double-brackets of the pattern [[Article Name]] should be
captured without the brackets: Article Name.
Nested approach (produces desired result)
I know I can handle #1 and #2 in a nestedre.sub() expression. For example, this does what I want:
s = 'including the [[Royal Danish Academy of Sciences and Letters|Danish Academy of Sciences]], [[Norwegian Academy of Science and Letters|Norwegian Academy of Sciences]], [[Russian Academy of Sciences]], and [[National Academy of Sciences|US National Academy of Sciences]].'
re.sub('\[\[(.*?\|)(.*?)\]\]','\\2', # case 1
re.sub('\[\[([^|]+)\]\]','\\1',s) # case 2
)
# result is correct:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.'
Single-pass approach (looking for solution here)
For efficiency and my own improvement, I would like to know whether there is a single-pass approach.
What I have tried: In an optional group 1, I want to greedy-capture everything between [[ and a | (if it exists). Then in group 2, I want to capture everything else up to the ]]. Then I want to return only group 2.
My problem is in making the greedy capture optional:
re.sub('\[\[([^|]*\|)?(.*?)\]\]','\\2',s)
# does NOT return the desired result:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, US National Academy of Sciences.'
# is missing: 'Russian Academy of Sciences, and '
See regex in use here
\[{2}(?:(?:(?!]{2})[^|])+\|)*((?:(?!]{2})[^|])+)]{2}
\[{2} Match [[
(?:(?:(?!]{2})[^|])+\|)* Matches the following any number of times
(?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
\| Matches | literally
((?:(?!]{2})[^|])+) Capture the following into capture group 1
(?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
]{2} Match ]]
Replacement \1
Result:
including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.
Another alternative that may work for you is the following. It's less specific than the regex above but doesn't include any lookarounds.
\[{2}(?:[^]|]+\|)*([^]|]+)]{2}
Related
I have the following pdf file I use PyPDF2 to extract text from it pdf image
and I'm looking for a regex to capture numbered sentences in the pdf file
I tried a couple of regex in the following code but the output is not as needed I need to capture the numbered points each as one sentence like this
expected OUTPUT
['1. Please admit that Plaintiff, JOSHUA PINK, received benefits from a collateral
source, as defined by §768.76, Florida Statutes, for medical bills alleged to have been incurred as
a result of the incident described in the Complaint.',2. please.....]
Instead of two regexes I tried either doesn't capture the full sentence or capture it in multiple lines and consider every \n as a new sentence
Extracted TEXT
" \n IN THE CIRCUIT COURT, OF THE \nEIGHTEENTH JUDICIAL CIRCUIT, IN \nAND FOR SEMINOLE COUNTY, \nFLORIDA \n \nCASE NO: 2022 -CA-002235 \n \nJOSHUA PINK, \n \n Plaintiff, \nvs. \n \nMATHEW ZUMBRUM , \n \n Defendant. \n / \n \nDEFENDANT'S REQUEST FOR ADMISSIONS TO PLAINTIFF, JOSHUA PINK \n \n \nCOME NOW the Defendant , MATHEW ZUMBRUM , by and through the undersigned \nattorneys, and pursuant to Rule 1.370, Florida Rul es of Civil Procedure, requests the Plaintiff, \nJOSHUA PINK, admit in this action that each of the following statements are true: \n1. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral \nsource, as defined by §768.76, Florida Statute s, for medical bills alleged to have been incurred as \na result of the incident described in the Complaint. \n2. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral \nsource, as defined by §768.76, Florida Statutes, for loss of wages o r income alleged to have been \nsustained as a result of the incident described in the Complaint. \n3. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal \nInjury Protection portion of an automobile policy for medical bills alleged to have been incurred \nas a result of the incident described in the Complaint. \n Filing # 162442429 E-Filed 12/06/2022 09:46:49 AM\n \n2 4. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal \nInjury Protection portion of an automobile insurance policy for loss of wages or income alleged \nto have been sustained as a result of the incident described in the Complaint. \n5. Please admit that Plaintiff, JOSHUA PINK , received benefits under the medical \npayments provisions of an automobile insurance policy for medical bills alleged to have been \nincurred as a result of the incident described in the Complaint. \n6. Please admit that Plaintiff, JOSHUA PINK , is subject to a deductible under the \nPersonal Injury Protection portion of an automobile insurance policy. \n7. Please admit that Plaintiff, JOSHUA PINK received benefits pursuant to personal \nor group health insurance policy, for medical bills alleged to have been incurred as a result of the \nincident described in the Complaint. \n8. Please admit that Plaintiff, JOSHUA PINK , received benefits pursuant to a \npersonal or group wage continuation plan or policy, for loss of wages or income alleged to have \nbeen sustained as a result of the incident described in the Complaint. \n 9. Please admit that on the date of the accident alleged in your Complaint, Defendant, \nMATHEW ZUMBRUM , complied with and met the security requirements under Chapter \n627.730 - 627.7405, Florida Statutes. \n10. Please admit that Plaintiff, JOSHUA PINK , was partially responsible for the \nsubject accident. \n11. Please admit that Plaintiff, JOSHUA PINK , did NOT suffer a permanent injury as \na result of the subject accident. \nI HEREBY CERTIFY that on the 6th day of December, 2022 a true and correct copy of \nthe foregoing was electronically filed with the Florida Court s E-Filing Portal system which will \n \n3 send a notice of electronic filing to Michael R. Vaughn, Esq., Morgan & Morgan, P.A., 20 N. \nOrange Ave, 16th Floor, Orlando, FL 32801 at mvaughn#forthepeople.com; \njburnham#forthepeople.com; mserrano#forthepeople.com. \nAND REW J. GORMAN & ASSOCIATES \n \nBY: \n \n(Original signed electronically by Attorney.) \nLOURDES CALVO -PAQUETTE, ESQ. \nAttorney for Defendant, Zumbrum \n390 N. Orange Avenue, Suite 1700 \nOrlando, FL 32801 \nTelephone: (407) 872 -2498 \nFacsímile: (855) 369 -8989 \nFlorida Bar No. 0817295 \nE-mail for service (FL R. Jud. Admin. 2.516) : \nflor.law -mlslaw.172o19#statefarm.com \n \nAttorneys and Staff of Andrew J. Gorman & \nAssociates are Employees of the Law Department \nof State Farm Mutual Automobile Insurance \nCompany. \n \n \n\n"
sample output of regex2 (sentence is captured in 2 lines)
[('2022', 'CA-002235 '),
('1', 'Florida Rul es of Civil Procedure, requests the Plaintiff,'),
('1',
'Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral'),
('768',
'Florida Statute s, for medical bills alleged to have been incurred as'),...]
sample output of regex1 (not capturing full sentence)
['1. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral ',
'2. Please admit that Plaintiff, JOSHUA PINK , received benefits from a collateral ',
'3. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal ',
'2 4. Please admit that Plaintiff, JOSHUA PINK , received benefits under the Personal ',
'5. Please admit that Plaintiff, JOSHUA PINK , received benefits under the medical ',....]
code:
def read_pdf(name):
reader = PdfReader(name,"rb")
text = ""
for page in reader.pages:
text += page.extract_text() + "\n"
#regex1 = r'(^[0-9].*)'
regex2 = r'([\d]+).+?([a-zA-Z].+).'
pat = re.compile(regex, re.M)
extracted_text = pat.findall(text)
return text,extracted_text
text,pdf1 = read_pdf(names[0])
I'll provide an answer to go over a couple of different patterns you can use to approach text items like that. Let's say you have a text that is structured like this:
test_str = """
Some preamble.
1. Very
long
sentence.
2. One-line sentence.
3. Another
longer sentence.
A new paragraph.
"""
First scenario: you want to match items that begin with a number followed by a period at the beginning of a line (with optional leading space) and end with a period at the end of a line - irrespective of how many characters it takes, but as few as possible. That's what your question reads like. One pattern that describes this is ^[ \t]*\d+\.[\s\S]*?\.$. The heavy lifting here is done by [\s\S]*? which is a lazy class that just matches any character (by including all spaces and all non-spaces) as few times as possible.
regex1 = re.compile(r"^[ \t]*\d+\.[\s\S]*?\.$", re.MULTILINE)
print(re.findall(regex1, test_str))
Which returns:
[' 1. Very\nlong\nsentence.', ' 2. One-line sentence.', ' 3. Another\nlonger sentence.']
If you want to exclude leading space, you could add a capturing group ^[ \t]*(\d+\.[\s\S]*?\.)$ in which case findall() will only return the captured part. In Python:
regex2 = re.compile(r"^[ \t]*(\d+\.[\s\S]*?\.)$", re.MULTILINE)
print(re.findall(regex2, test_str))
Which returns:
['1. Very\nlong\nsentence.', '2. One-line sentence.', '3. Another\nlonger sentence.']
First scenario, alternative expression: after the leading number, express the match in terms of lines; always get the first line and add every following line as long as the preceding line does not end in a period: ^[ \t]*(\d+\..*(?:[^.]$\r?\n.*)*\.)$. This will be faster than the lazy class in the first example and returns the same as with regex2.
regex3 = re.compile(r"^[ \t]*(\d+\..*(?:[^.]$\r?\n.*)*\.)$", re.MULTILINE)
print(re.findall(regex3, test_str))
Second scenario: we don't care what the sentence(s) end in. Just get complete items, which we'll interpret as the leading number followed by all lines that do not start with another leading number or an entirely new paragraph: ^[ \t]*(\d+\..+$(?:\r?\n(?![ \t]*\d+\.|A new).*)*).
This makes use of a negative lookahead (?![ \t]*\d+\.|A new) to prevent matching lines that start either with a new item number or some non-item text and allows more control over what kind of lines may constitute an item. Return values are the same.
regex4 = re.compile(r"^[ \t]*(\d+\..+$(?:\r?\n(?![ \t]*\d+\.|A new).*)*)", re.MULTILINE)
print(re.findall(regex4, test_str))
If you want to match sentences followed by a dot, you might use:
\b\d+\.[^\S\n][^.]*(?:\.(?=\S)[^.]*)*\.
Explanation
\b A word boundary to prevent a partial word match
\d+\.[^\S\n] Match 1+ digits, a dot and a space
[^.]*(?:\.(?=\S)[^.]*)* Optionally match any character except for dots, and then only match the dot when there is a non whitespace character following.
\. Match a dot
See a regex demo.
A pattern with more punctuation characters:
\b\d+\.[^\S\n][^.!?]*(?:[.!?](?=\S)[^.!?]*)*[.!?]
See another regex demo.
Try this:
(\d+\.\s)(.|\n)*?(?=\d+\.\s|\z|\.\s)
This will match from any number followed by a period and a space to the end of the sentence (period followed by a space) or until the next number followed by a period and a space or the end of the string.
See example here
Recommend using Punkt Sentence Tokenizer or any other NLP package of your choice as writing a general purpose regex to detect sentence can be very tricky unless you have only a very strictly defined pattern with limited scope! For example, if you take only numbered sentences then the following regex might work: "\d\.(.)+[a-z]\."gmi
I'm working on parsing string text containing information on university, year, degree field, and whether or not a person graduated. Here are two examples:
ex1 = 'BYU: 1990 Bachelor of Arts Theater (Graduated):BYU: 1990 Bachelor of Science Mathematics (Graduated):UNIVERSITY OF VIRGINIA: 1995 Master of Science Mechanical Engineering (Graduated):MICHIGAN STATE UNIVERSITY: 2008 Master of Fine Arts INDUSTRIAL DESIGN (Graduated)'
ex2 = 'UCSD: 2001 Bachelor of Arts English:UCLA: 2005 Bachelor of Science Economics (Graduated):UCSD 2010 Master of Science Economics'
What I am struggling to accomplish is to have an entry for each school experience regardless of whether specific information is missing. In particular, imagine I wanted to pull whether each degree was finished from ex1 and ex2 above. When I try to use re.findall I end up with something like the following for ex1:
# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex1)
# Output:
['Graduated', 'Graduated']
which is what I want, two entries for two Bachelor's degrees. For ex2, however, one of the Bachelor's degrees was unfinished so the text does not contain "(Graduated)", so the output is the following:
# Code:
re.findall('[A-Z ]+: \d+ Bachelor [^:]+\((Graduated)', ex2)
# Output:
['Graduated']
# Desired Output:
['', 'Graduated']
I have tried making the capture group optional or including the colon after graduated and am not making much headway. The example I am using is the "Graduated" information, but in principle the more general question remains if there is an identifiable degree but it is missing one or two pieces of information (like graduation year or university). Ultimately I am just looking to have complete information on each degree, including whether certain pieces of information are missing. Thank you for any help you can provide!
You can use the ?-Quantifier to match "Graduated" (and the paranthesis () between 0 and n times.
re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
Output:
>>> re.findall('[A-Z ]+: \d+ Bachelor [^:()]*\(?(Graduated)?', ex2)
['', 'Graduated']
How about this?
[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(]+','', s) for s in re.findall('[A-Z ]+: \d+ Bachelor [^:]+:', ex1)]]
# output ['Graduated', 'Graduated']
[re.sub('[(:)]', '', t) for t in [re.sub('^[^\(]+','', s) for s in re.findall('[A-Z ]+: \d+ Bachelor [^:]+:', ex2)]]
# output ['', 'Graduated']
Consider the following original strings showed in the first columns of the following table:
Original String Parsed String Desired String
'W. & J. JOHNSON LMT.COM' #W J JOHNSON LIMITED #WJ JOHNSON LIMITED
'NORTH ROOF & WORKS CO. LTD.' #NORTH ROOF WORKS CO LTD #NORTH ROOF WORKS CO LTD
'DAVID DOE & CO., LIMITED' #DAVID DOE CO LIMITED #DAVID DOE CO LIMITED
'GEORGE TV & APPLIANCE LTD.' #GEORGE TV APPLIANCE LTD #GEORGE TV APPLIANCE LTD
'LOVE BROS. & OTHERS LTD.' #LOVE BROS OTHERS LTD #LOVE BROS OTHERS LTD
'A. B. & MICHAEL CLEAN CO. LTD.'#A B MICHAEL CLEAN CO LTD #AB MICHAEL CLEAN CO LTD
'C.M. & B.B. CLEANER INC.' #C M B B CLEANER INC #CMBB CLEANER INC
Punctuation needs to be removed which I have done as follows:
def transform(word):
word = re.sub(r'(?<=[A-Za-z])\'(?=[A-Za-z])[A-Z]|[^\w\s]|(.com|COM)',' ',word)
However, there is one last point which I have not been able to get. After removing punctuations I ended up with lots of spaces. How can I have a regular expression that put together initials and keep single spaces for regular words (no initials)?
Is this a bad approach to substitute the mentioned characters to get the desired strings?
Thanks for allowing me to continue learning :)
I think it's simpler to do this in parts. First, remove .com and any punctuation other than space or &. Then, remove a space or & surrounded by only one letter. Finally, replace any remaining sequence of space or & with a single space:
import re
strings = ['W. & J. JOHNSON LMT.COM',
'NORTH ROOF & WORKS CO. LTD.',
'DAVID DOE & CO., LIMITED',
'GEORGE TV & APPLIANCE LTD.',
'LOVE BROS. & OTHERS LTD.',
'A. B. & MICHAEL CLEAN CO. LTD.',
'C.M. & B.B. CLEANER INC.'
]
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z& ]+', '', s, 0, re.IGNORECASE)
s = re.sub(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)', '', s)
s = re.sub(r'\s*[& ]\s*', ' ', s)
print s
Output
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Demo on rextester
Update
This was written before the edit to the question changing the required result for the last data. Given the edit, the above code can be simplified to
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z ]+|\s(?=&)|(?<!\w\w)\s+(?!\w\w)', '', s, 0, re.IGNORECASE)
print s
Demo on rextester
Doing this in regex alone won't be pretty and is not the best solution, yet, here it is! You're better off doing a multiple step approach. What I've done is identified all the cases that are possible and opted to find a solution where there's no replacement string since you're not always replacing the characters with spaces.
Rules
Non "Stacked" Abbreviations
These are locations like A. B. or W. & J., but not C.M. & B.B.
I've identified these as locations where an abbreviation part (e.g. A.) exists before and after, but the latter is not followed by another alpha character
Preceding Space
These locations don't exist in your text but could if a space preceded a non-alpha character without a space following it (say at the end of a line)
We match the characters after the first space in these cases
Proceeding Space
These are locations like & and the dot in J.
We match the character before the last space in those examples
No Spaces
These are locations like 'LOVE (the apostrophe in that string)
We only match the non-alpha-non-whitespace characters
Regex
An all-in-one regex that accomplishes this is as follows:
See regex in use here
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )
Works as follows (broken into each alternation):
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z])) matches non-alpha characters between A. and B. but not A. and B.B
(?<=\b[a-z]) positive lookbehind ensuring what precedes is an alpha character and assert a word boundary position to its left
[^a-z]+ match any non-alpha character one or more times
(?=[a-z]\b(?![^a-z][a-z])) positive lookahead ensuring the following exists
[a-z]\b match any alpha character and assert a word boundary position to its right
(?![^a-z][a-z]) negative lookahead ensuring what follows is not a non-alpha character followed by an alpha character
(?<= ) *(?:\.com\b|[^a-z\s]+) * ensures a space precedes, then matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces
(?<= ) positive lookbehind ensuring a space precedes
* match any number of spaces
(?:\.com\b|[^a-z\s]+) match .com and ensure a non-word character follows, or match any non-word-non-whitespace character one or more times
* match any number of spaces
*(?:\.com\b|[^a-z\s]+) *(?= ) matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces, then ensures a space follows
Same as previous but instead of the positive lookbehind at the beginning, there's a positive lookahead at the end
(?<! )(?:\.com\b|[^a-z\s]+)(?! ) matches .com or any non-alpha-non-whitespace characters one or more times ensuring no spaces surround it
Same as previous two options but uses negative lookbehind and negative lookahead
Code
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'"
]
r = re.compile(r'(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )', re.IGNORECASE)
def transform(word):
return re.sub(r, '', word)
for s in strings:
print(transform(s))
Outputs:
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Edit
Using a callback, you can extend this logic to include special cases as mentioned in a comment below my answer to match specific cases and have conditional replacements.
These special cases include:
FONTAINE'S to FONTAINE
PREMIUM-FIT AUTO to PREMIUM FIT AUTO
62325 W.C. to 62325 WC
I added a new alternation to the regex: (\b[\'-]\b(?:[a-z\d] )?) to capture 'S or - between letters (also -S or similar) and replace it with a space using the callback (if the capture group exists).
I still suggest using multiple regular expressions to accomplish this, but I wanted to show that it is possible with a single pattern.
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'",
"'FONTAINE'S PREMIUM-FIT AUTO 62325 W.C.'"
]
r = re.compile(r'(?<=\b[a-z\d])[^a-z\d]+(?=[a-z\d]\b(?![^a-z\d][a-z\d]))|(?<= ) *(?:\.com\b|[^a-z\d\s]+) *| *(?:\.com\b|[^a-z\d\s]+) *(?= )|(\b[\'-]\b(?:[a-z\d] )?)|(?<! )(?:\.com\b|[^a-z\d\s]+)(?! )', re.IGNORECASE)
def repl(m):
return ' ' if m.group(1) else ''
for s in strings:
print(r.sub(repl, s))
Here's the simplest I could get it with one regex pattern:
\.COM|(?<![A-Z]{2}) (?![A-Z]{2})|[.&,]| (?>)&
Basically, it removes characters that fit 3 criteria:
Literal ".COM"
Spaces that are not preceded or followed by 2 capital letters
Dots, ampersands, and commas, regardless of where they appear
Spaces followed by ampersands
Demo: https://regex101.com/r/EMHxq9/2
Basically, I want to remove the certain phrase patterns embedded in my text data:
Starts with an upper case letter and ends with an Em Dash "—"
Starts with an Em Dash "—" and ends with a "Read Next"
Say, I've got the following data:
CEBU CITY—The widow of slain human rights lawyer .... citing figures from the NUPL that showed that 34 lawyers had been killed in the past two years. —WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next
and
Manila, Philippines—President .... but justice will eventually push its way through their walls of impunity, ... —REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next
I want to remove the following phrases:
"CEBU CITY—"
"—WITH REPORTS FROM JULIE M. AURELIO AND DJ YAPRead Next"
"Manila, Philippines—"
"—REPORTS FROM MELVIN GASCON, JULIE M. AURELIO, DELFIN T. MALLARI JR., JEROME ANING, JOVIC YEE, GABRIEL PABICO LALU, PATHRICIA ANN V. ROXAS, DJ YAP, AFP, APRead Next"
I am assuming this would be needing two regex for each patterns enumerated above.
The regex: —[A-Z].*Read Next\s*$ may work on the pattern # 2 but only when there are no other em dashes in the text data. It will not work when pattern # 1 occurs as it will remove the chunk from the first em dash it has seen until the "Read Next" string.
I have tried the following regex for pattern # 1:
^[A-Z]([A-Za-z]).+(—)$
But how come it does not work. That regex was supposed to look for a phrase that starts with any upper case letter, followed by any length of string as long as it ends with an "—".
What you are considering a hyphen - is not indeed a hyphen instead called Em Dash, hence you need to use this regex which has em dash instead of hyphen in start,
^—[A-Z].*Read Next\s*$
Here is the explanation for this regex,
^ --> Start of input
— --> Matches a literal Em Dash whose Unicode Decimal Code is 8212
[A-Z] --> Matches an upper case letter
.* --> Matches any character zero or more times
Read Next --> Matches these literal words
\s* --> This is for matching any optional white space that might be present at the end of line
$ --> End of input
Online demo
The regex that should take care of this -
^—[A-Z]+(.)*(Read Next)$
You can try implementing this regex on your data and see if it works out.
I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^\n\d{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this?
Edit
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9\n \n52608670\n \nWestcon\n \nGroup European Operations Netherlands Branch\n \n30221053\n \nWestland Infra Netbeheer B.V.\n \n27176688\n \nWetransfer 85 B.V.\n \n34380998\n \nWETRAVEL B.V.\n \n70669783\n \nWeWork Companies (International) B.V.\n \n61501220\n \nWeWork Netherlands B.V.\n \n61505439\n \nWexford Finance B.V.\n \n27124941\n \nWFC\n-\nFood Safety B.V.\n \n11069471\n \nWhale Cloud Technology Netherlands B.V.\n \n63774801\n \nWHILL Europe B.V.\n \n72465700\n \nWhirlpool Nederland B.V.\n \n20042061\n \nWhitaker\n-\nTaylor Netherlands B.V.\n \n66255163\n \nWhite Oak B.V.\n'
re.findall(r'[^\n\d{6,}](?:(?:[a-z\s.]+(\n[a-z\s.])*)|.+)',text)
I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*\d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']
This will create one group for lines that don't have numbers.
regex: /(?!(\d{6,}|\n))[a-zA-Z .\n]+/g
Demo: https://regex101.com/r/MMLGw6/1
Assuming your company names starts with a letter, you may use this regex with re.M modifier:
^[a-zA-Z].*(?:\n+[a-zA-Z].*)*(?=\n+\d{6,}$)
RegEx Demo
In python:
regex = re.compile(r"^[a-zA-Z].*(?:\n+[a-zA-Z].*)*(?=\n+\d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by \n that also start with [a-zA-Z] characters.
(?=\n+\d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.
If you can solve this without regex it should be solved without regex:
useful = []
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work - more or less. Replying from my phone so can't test.