Using regex to extract based on a recurring pattern excluding newline characters

Using regex to extract based on a recurring pattern excluding newline characters - python

I have a string as follows:
27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
This string contains many newline characters, I wanted to explicitly ignore these as well as all multidigits with 6 or more numbers. I came up with the following regex expression:
[^\n\d{6,}].+
This almost takes me there as it returns all the company names, however in cases where the company name itself contains a new line character these get returned as two different company names. For instance Westcon is a match and Group European Operations Netherlands Branch is also a match. I would like to tweak the above expression to make sure that the final match is Westcon European Operations Netherlands Branch. What regex concepts should I use to achieve this?
Edit
I tried the following based on the comment below but got the wrong result
text = 'West Food Group B.V.9\n \n52608670\n \nWestcon\n \nGroup European Operations Netherlands Branch\n \n30221053\n \nWestland Infra Netbeheer B.V.\n \n27176688\n \nWetransfer 85 B.V.\n \n34380998\n \nWETRAVEL B.V.\n \n70669783\n \nWeWork Companies (International) B.V.\n \n61501220\n \nWeWork Netherlands B.V.\n \n61505439\n \nWexford Finance B.V.\n \n27124941\n \nWFC\n-\nFood Safety B.V.\n \n11069471\n \nWhale Cloud Technology Netherlands B.V.\n \n63774801\n \nWHILL Europe B.V.\n \n72465700\n \nWhirlpool Nederland B.V.\n \n20042061\n \nWhitaker\n-\nTaylor Netherlands B.V.\n \n66255163\n \nWhite Oak B.V.\n'
re.findall(r'[^\n\d{6,}](?:(?:[a-z\s.]+(\n[a-z\s.])*)|.+)',text)

I think that you only want the company names. If so, this should work.
input = '''27223525
West Food Group B.V.9
52608670
Westcon
Group European Operations Netherlands Branch
30221053
Westland Infra Netbeheer B.V.
27176688
Wetransfer 85 B.V.
34380998
WETRAVEL B.V.
70669783
'''
company_name_regex = re.findall(r'[A-Za-z].*|[A-Za-z].*\d{1,5}.*', input)
pprint(company_name_regex)
['West Food Group B.V.9',
'Westcon',
'Group European Operations Netherlands Branch',
'Westland Infra Netbeheer B.V.',
'Wetransfer 85 B.V.'
'WETRAVEL B.V.']

This will create one group for lines that don't have numbers.
regex: /(?!(\d{6,}|\n))[a-zA-Z .\n]+/g
Demo: https://regex101.com/r/MMLGw6/1

Assuming your company names starts with a letter, you may use this regex with re.M modifier:
^[a-zA-Z].*(?:\n+[a-zA-Z].*)*(?=\n+\d{6,}$)
RegEx Demo
In python:
regex = re.compile(r"^[a-zA-Z].*(?:\n+[a-zA-Z].*)*(?=\n+\d{6,}$)", re.M)
This matches a line that starts with [a-zA-Z] until end of line and then matches more lines separated by \n that also start with [a-zA-Z] characters.
(?=\n+\d{6,}$) is a lookahead assertion to make sure our company names have a newline and 6+ digits ahead.

If you can solve this without regex it should be solved without regex:
useful = []
for line in text.split():
if line.strip() and not line.isdigit():
useful.append(line)
That should work - more or less. Replying from my phone so can't test.

Related

Tokenize paragraphs by special characters; then rejoin so tokenized segments to reach certain length

I have this long paragraph:
paragraph = "The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy; the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X; the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England; the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss; the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man."
My goal is to split this long paragraph into multiple sentences keeping the sentences around 18 - 30 words each.
There is only one full-stop at the end; so nltk tokenizer is of no use. I can use regex to tokenize; I have this pattern that works in splitting:
regex_special_chars = '([″;*"(§=!‡…†\\?\\]‘)¿♥[]+)'
new_text = re.split(regex_special_chars, paragraph)
The question is how to join this paragraph into a list of multiple sentences that would be around 18 to 30; where possible; because sometimes it's not possible with this regex.
The end result will look like the following list below:
tokenized_paragraph = ['The weakening of the papacy by the Avignon exile and the Papal Schism; the breakdown of monastic discipline and clerical celibacy;',
'the luxury of prelates, the corruption of the Curia, the worldly activities of the popes; the morals of Alexander VI, the wars of Julius II, the careless gaiety of Leo X;',
'the relicmongering and peddling of indulgences; the triumph of Islam over Christendom in the Crusades and the Turkish wars; the spreading acquaintance with non-Christian faiths; ',
'the influx of Arabic science and philosophy; the collapse of Scholasticism in the irrationalism of Scotus and the skepticism of Ockham; the failure of the conciliar movement to effect reform; ',
'the discovery of pagan antiquity and of America; the invention of printing; the extension of literacy and education; the translation and reading of the Bible; ',
'the newly realized contrast between the poverty and simplicity of the Apostles and the ceremonious opulence of the Church; the rising wealth and economic independence of Germany and England;',
'the growth of a middle class resentful of ecclesiastical restrictions and claims; the protests against the flow of money to Rome; the secularization of law and government; ',
'the intensification of nationalism and the strengthening of monarchies; the nationalistic influence of vernacular languages and literatures; the fermenting legacies of the Waldenses, Wyclif, and Huss;',
'the mystic demand for a less ritualistic, more personal and inward and direct religion: all these were now uniting in a torrent of forces that would crack the crust of medieval custom, loosen all standards and bonds, shatter Europe into nations and sects, sweep away more and more of the supports and comforts of traditional beliefs, and perhaps mark the beginning of the end for the dominance of Christianity in the mental life of European man.']
if we check the lengths of the end result; we get this many words into each tokenized segment:
[len(sent.split()) for sent in tokenized_paragraph]
[21, 31, 25, 30, 25, 29, 27, 26, 76]
Only the last segment exceeded 30 words (76 words), and that's okay!
Edit
The regex could include a colon : So the last segment would be less than 76

I would suggest using findall instead of split.
Then the regex could be:
(?:\S+\s+)*?(?:\S+\s+){17,29}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)
Break-down:
\S+\s+ a word and the space(s) that follow it
(?:\S+\s+)*?(?:\S+\s+){17,29}: lazily match some words followed by a space (so initially it wont match any) and then greedily match as many words as possible up to 29, but at least 17, and all that ending with white space. The first lazy match is needed for when no match completes with just the greedy part.
\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+): match one more word, terminated by a terminator character, or the end of the string.
So:
regex = r'(?:\S+\s+)*?(?:\S+\s+){18,30}\S+(?:$|[″;*"(§=!‡…†\?\]‘)¿♥[]+)'
new_text = re.findall(regex, paragraph)
for line in new_text:
print(len(line.split()), line)
The number of words per paragraph are:
[21, 31, 25, 30, 25, 29, 27, 26, 76]

REGEX: Remove spaces between strings with one or two letters

Consider the following original strings showed in the first columns of the following table:
Original String Parsed String Desired String
'W. & J. JOHNSON LMT.COM' #W J JOHNSON LIMITED #WJ JOHNSON LIMITED
'NORTH ROOF & WORKS CO. LTD.' #NORTH ROOF WORKS CO LTD #NORTH ROOF WORKS CO LTD
'DAVID DOE & CO., LIMITED' #DAVID DOE CO LIMITED #DAVID DOE CO LIMITED
'GEORGE TV & APPLIANCE LTD.' #GEORGE TV APPLIANCE LTD #GEORGE TV APPLIANCE LTD
'LOVE BROS. & OTHERS LTD.' #LOVE BROS OTHERS LTD #LOVE BROS OTHERS LTD
'A. B. & MICHAEL CLEAN CO. LTD.'#A B MICHAEL CLEAN CO LTD #AB MICHAEL CLEAN CO LTD
'C.M. & B.B. CLEANER INC.' #C M B B CLEANER INC #CMBB CLEANER INC
Punctuation needs to be removed which I have done as follows:
def transform(word):
word = re.sub(r'(?<=[A-Za-z])\'(?=[A-Za-z])[A-Z]|[^\w\s]|(.com|COM)',' ',word)
However, there is one last point which I have not been able to get. After removing punctuations I ended up with lots of spaces. How can I have a regular expression that put together initials and keep single spaces for regular words (no initials)?
Is this a bad approach to substitute the mentioned characters to get the desired strings?
Thanks for allowing me to continue learning :)

I think it's simpler to do this in parts. First, remove .com and any punctuation other than space or &. Then, remove a space or & surrounded by only one letter. Finally, replace any remaining sequence of space or & with a single space:
import re
strings = ['W. & J. JOHNSON LMT.COM',
'NORTH ROOF & WORKS CO. LTD.',
'DAVID DOE & CO., LIMITED',
'GEORGE TV & APPLIANCE LTD.',
'LOVE BROS. & OTHERS LTD.',
'A. B. & MICHAEL CLEAN CO. LTD.',
'C.M. & B.B. CLEANER INC.'
]
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z& ]+', '', s, 0, re.IGNORECASE)
s = re.sub(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)', '', s)
s = re.sub(r'\s*[& ]\s*', ' ', s)
print s
Output
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Demo on rextester
Update
This was written before the edit to the question changing the required result for the last data. Given the edit, the above code can be simplified to
for s in strings:
s = re.sub(r'\.COM|[^a-zA-Z ]+|\s(?=&)|(?<!\w\w)\s+(?!\w\w)', '', s, 0, re.IGNORECASE)
print s
Demo on rextester

Doing this in regex alone won't be pretty and is not the best solution, yet, here it is! You're better off doing a multiple step approach. What I've done is identified all the cases that are possible and opted to find a solution where there's no replacement string since you're not always replacing the characters with spaces.
Rules
Non "Stacked" Abbreviations
These are locations like A. B. or W. & J., but not C.M. & B.B.
I've identified these as locations where an abbreviation part (e.g. A.) exists before and after, but the latter is not followed by another alpha character
Preceding Space
These locations don't exist in your text but could if a space preceded a non-alpha character without a space following it (say at the end of a line)
We match the characters after the first space in these cases
Proceeding Space
These are locations like & and the dot in J.
We match the character before the last space in those examples
No Spaces
These are locations like 'LOVE (the apostrophe in that string)
We only match the non-alpha-non-whitespace characters
Regex
An all-in-one regex that accomplishes this is as follows:
See regex in use here
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )
Works as follows (broken into each alternation):
(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z])) matches non-alpha characters between A. and B. but not A. and B.B
(?<=\b[a-z]) positive lookbehind ensuring what precedes is an alpha character and assert a word boundary position to its left
[^a-z]+ match any non-alpha character one or more times
(?=[a-z]\b(?![^a-z][a-z])) positive lookahead ensuring the following exists
[a-z]\b match any alpha character and assert a word boundary position to its right
(?![^a-z][a-z]) negative lookahead ensuring what follows is not a non-alpha character followed by an alpha character
(?<= ) *(?:\.com\b|[^a-z\s]+) * ensures a space precedes, then matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces
(?<= ) positive lookbehind ensuring a space precedes
* match any number of spaces
(?:\.com\b|[^a-z\s]+) match .com and ensure a non-word character follows, or match any non-word-non-whitespace character one or more times
* match any number of spaces
*(?:\.com\b|[^a-z\s]+) *(?= ) matches any spaces, .com or any non-word-non-whitespace characters one or more times, then any spaces, then ensures a space follows
Same as previous but instead of the positive lookbehind at the beginning, there's a positive lookahead at the end
(?<! )(?:\.com\b|[^a-z\s]+)(?! ) matches .com or any non-alpha-non-whitespace characters one or more times ensuring no spaces surround it
Same as previous two options but uses negative lookbehind and negative lookahead
Code
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'"
]
r = re.compile(r'(?<=\b[a-z])[^a-z]+(?=[a-z]\b(?![^a-z][a-z]))|(?<= ) *(?:\.com\b|[^a-z\s]+) *| *(?:\.com\b|[^a-z\s]+) *(?= )|(?<! )(?:\.com\b|[^a-z\s]+)(?! )', re.IGNORECASE)
def transform(word):
return re.sub(r, '', word)
for s in strings:
print(transform(s))
Outputs:
WJ JOHNSON LMT
NORTH ROOF WORKS CO LTD
DAVID DOE CO LIMITED
GEORGE TV APPLIANCE LTD
LOVE BROS OTHERS LTD
AB MICHAEL CLEAN CO LTD
CM BB CLEANER INC
Edit
Using a callback, you can extend this logic to include special cases as mentioned in a comment below my answer to match specific cases and have conditional replacements.
These special cases include:
FONTAINE'S to FONTAINE
PREMIUM-FIT AUTO to PREMIUM FIT AUTO
62325 W.C. to 62325 WC
I added a new alternation to the regex: (\b[\'-]\b(?:[a-z\d] )?) to capture 'S or - between letters (also -S or similar) and replace it with a space using the callback (if the capture group exists).
I still suggest using multiple regular expressions to accomplish this, but I wanted to show that it is possible with a single pattern.
See code in use here
import re
strings = [
"'W. & J. JOHNSON LMT.COM'",
"'NORTH ROOF & WORKS CO. LTD.'",
"'DAVID DOE & CO., LIMITED'",
"'GEORGE TV & APPLIANCE LTD.'",
"'LOVE BROS. & OTHERS LTD.'",
"'A. B. & MICHAEL CLEAN CO. LTD.'",
"'C.M. & B.B. CLEANER INC.'",
"'FONTAINE'S PREMIUM-FIT AUTO 62325 W.C.'"
]
r = re.compile(r'(?<=\b[a-z\d])[^a-z\d]+(?=[a-z\d]\b(?![^a-z\d][a-z\d]))|(?<= ) *(?:\.com\b|[^a-z\d\s]+) *| *(?:\.com\b|[^a-z\d\s]+) *(?= )|(\b[\'-]\b(?:[a-z\d] )?)|(?<! )(?:\.com\b|[^a-z\d\s]+)(?! )', re.IGNORECASE)
def repl(m):
return ' ' if m.group(1) else ''
for s in strings:
print(r.sub(repl, s))

Here's the simplest I could get it with one regex pattern:
\.COM|(?<![A-Z]{2}) (?![A-Z]{2})|[.&,]| (?>)&
Basically, it removes characters that fit 3 criteria:
Literal ".COM"
Spaces that are not preceded or followed by 2 capital letters
Dots, ampersands, and commas, regardless of where they appear
Spaces followed by ampersands
Demo: https://regex101.com/r/EMHxq9/2

regex capture text in brackets, omitting optional prefix

I'm trying to convert some documents (Wikipedia articles) which contain links with a specific markdown convention. I want to render these to be reader-friendly without links. The convention is:
Names in double-brackets with of the pattern [[Article Name|Display Name]] should be captured ignoring the pipe and preceding text as well as enclosing brackets:
Display Name.
Names in double-brackets of the pattern [[Article Name]] should be
captured without the brackets: Article Name.
Nested approach (produces desired result)
I know I can handle #1 and #2 in a nestedre.sub() expression. For example, this does what I want:
s = 'including the [[Royal Danish Academy of Sciences and Letters|Danish Academy of Sciences]], [[Norwegian Academy of Science and Letters|Norwegian Academy of Sciences]], [[Russian Academy of Sciences]], and [[National Academy of Sciences|US National Academy of Sciences]].'
re.sub('\[\[(.*?\|)(.*?)\]\]','\\2', # case 1
re.sub('\[\[([^|]+)\]\]','\\1',s) # case 2
)
# result is correct:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.'
Single-pass approach (looking for solution here)
For efficiency and my own improvement, I would like to know whether there is a single-pass approach.
What I have tried: In an optional group 1, I want to greedy-capture everything between [[ and a | (if it exists). Then in group 2, I want to capture everything else up to the ]]. Then I want to return only group 2.
My problem is in making the greedy capture optional:
re.sub('\[\[([^|]*\|)?(.*?)\]\]','\\2',s)
# does NOT return the desired result:
'including the Danish Academy of Sciences, Norwegian Academy of Sciences, US National Academy of Sciences.'
# is missing: 'Russian Academy of Sciences, and '

See regex in use here
\[{2}(?:(?:(?!]{2})[^|])+\|)*((?:(?!]{2})[^|])+)]{2}
\[{2} Match [[
(?:(?:(?!]{2})[^|])+\|)* Matches the following any number of times
(?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
\| Matches | literally
((?:(?!]{2})[^|])+) Capture the following into capture group 1
(?:(?!]{2})[^|])+ Tempered greedy token matching any character one or more times except | or location that matches ]]
]{2} Match ]]
Replacement \1
Result:
including the Danish Academy of Sciences, Norwegian Academy of Sciences, Russian Academy of Sciences, and US National Academy of Sciences.
Another alternative that may work for you is the following. It's less specific than the regex above but doesn't include any lookarounds.
\[{2}(?:[^]|]+\|)*([^]|]+)]{2}

Regex - Matching repeating pattern

I have the following string which contains a repeating pattern of text followed by parentheses with an ID number.
The New York Yankees (12980261666)\n\nRedsox (1901659429)\nMets (NYC)
(21135721896)\nKansas City Royals (they are 7-1) (222497247812331)\n\n
other team (618006)\n
I'm struggling to write a regex that would return:
The New York Yankees (12980261666)
Redsox (1901659429)
Mets (NYC) (21135721896)
Kansas City Royals (they are 7-1) (222497247812331)
other team (618006)
The newline character could be replaced later with a string.replace('/n', '').

use the negate character to achieve this.
String pat="([^\\n])"

Extract Alberta (Canada) postal code through regular expressions in python

I want to extract postal codes of Alberta (Canada) region from an address string.
For example:
addr = '12345-67 Ave, Edmonton, AB T1A 2B3, Canada'
Should extract T1A 2B3.
The regular expression to match the postal code is [T]\d[A-Z] *\d[A-Z]\d. However, I do not know that given an entire address, how can I extract only the postal code? I guess it has to do something with backreferences () but I cannot figure it out.
How can I achieve this in Python?

Extracting just the substring that matched the regexp is easy enough:
test = re.compile(r'[T]\d[A-Z] *\d[A-Z]\d')
addr = '12345-67 Ave, Edmonton, AB T1A 2B3, Canada'
test.search(addr).group()
test.search will return a match object, which has all kinds of stuff you can extract.

Building on #Peter's Answer here is how you can do it for some more postal codes:
US:
addr= 'Statue of liberty, New York, NY 10004, USA'
test = re.compile(r'\d{5}')
test.search(addr).group()
UK:
addr= 'Olympic Park, Montfichet Rd, London E20 1EJ, United Kingdom'
test = re.compile(r'[A-Z]\d\d\s\d[A-Z]\d')
Canada:
addr= 'Toronto City Hall, 100 Queen St W, Toronto, ON M5H 2N2'
test = re.compile(r'[A-Z]\d[A-Z]\s\d[A-Z]\d')
[A-Z] Matches any uppercase letter in range A-Z
[a-zA-Z] Matches any uppercase letter in range A-Z (case insensitive)
\d matches any digit
\d{n} matches any occurrence of n digits
\s matches any whitespace character
You can also use Regex101, which is a very helpful tool for testing Regexes.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Using regex to extract based on a recurring pattern excluding newline characters - python

This will create one group for lines that don't have numbers. regex: /(?!(\d{6,}|\n))[a-zA-Z .\n]+/g Demo: https://regex101.com/r/MMLGw6/1

If you can solve this without regex it should be solved without regex: useful = [] for line in text.split(): if line.strip() and not line.isdigit(): useful.append(line) That should work - more or less. Replying from my phone so can't test.

Related

Tokenize paragraphs by special characters; then rejoin so tokenized segments to reach certain length

REGEX: Remove spaces between strings with one or two letters

regex capture text in brackets, omitting optional prefix

Regex - Matching repeating pattern

Extract Alberta (Canada) postal code through regular expressions in python

Categories

Resources