I'm trying to use RegEx Tokenize to split the following string data:
"• General The general responsibilities of the role are XYZ • Physical Demands The Physical Demands of this role are XYZ • Education Requirements The education requirements for this role are • Bachelor's Degree • Appropriate Certification • Experience 5 years of experience is required"
I want to reach this as the final stage:
A header
• General The general responsibilities of the role are XYZ
• Physical Demands The Physical Demands of this role are XYZ
• Education Requirements The education requirements for this role are • Bachelor's Degree • Appropriate Certification
• Experience 5 years of experience is required"
I've had success with grouping it, and parsing it, but it's not as dynamic as I'd like.
There is a pattern I want to split by: • words multiple spaces i.e. •.*?\s{3,}
NOTE: one of the categories uses bullet points within it (Education Requirements). This is the part that I find most problematic.
Any help would be greatly appreciated! Perhaps RegEx Tokenize isn't the most dynamic either.
You might use:
•\s+[^\s•].*?\s{3,}.*?(?=•[^•\n]*?\s{3}|$)
Explanation
•\s+ Match • and 1+ whitespace chars
[^\s•].*? Match a non whitespace char other than • and then match any character, as few as possible
\s{3,} Match 3 or more whitespace chars
.*? Match any character, as few as possible
(?= Positive lookahead, assert that to the right is
•[^•\n]*?\s{3} Match •, then as few as possible chars other than • or a newline followed by 3 whitespace chars
| Or
$ End of string
) Close the lookahead
See a regex101 demo and a Python demo
import re
s = "• General The general responsibilities of the role are XYZ • Physical Demands The Physical Demands of this role are XYZ • Education Requirements The education requirements for this role are • Bachelor's Degree • Appropriate Certification • Experience 5 years of experience is required"
pattern = r"•\s+[^\s•].*?\s{3,}.*?(?=•[^•\n]*?\s{3}|$)"
result = re.findall(pattern, s)
print(result)
Output
[
'• General The general responsibilities of the role are XYZ ',
'• Physical Demands The Physical Demands of this role are XYZ ',
"• Education Requirements The education requirements for this role are • Bachelor's Degree • Appropriate Certification ",
'• Experience 5 years of experience is required'
]
Note that using \s can also match a newline. If you don't want to match newlines, you can use [^\S\n] instead.
Related
Closed. This question needs debugging details. It is not currently accepting answers.
Edit the question to include desired behavior, a specific problem or error, and the shortest code necessary to reproduce the problem. This will help others answer the question.
Closed 6 days ago.
Improve this question
I'm trying to come up with a proper regex pattern (and I am very bad at it) for the strings that I have. Each time I end up with something that only works partly. I'll show the pattern that I made later below, but first, I want to specify what I want to extract out of a text.
Data:
Company Fragile9 Closes €9M Series B Funding
Appplle21 Receives CAD$17.5K in Equity Financing
Cat Raises $10.8 Millions in Series A Funding
Sun Raises EUR35M in Funding at a $1 Billion Valuation
Japan1337 Announces JPY 1.78 Billion Funding Round
From that data I need only to extract the amount of money a company receives (including $/€ etc, and a specification of currency if it's there, like Canadians dollars (CAD)).
So, in result, I expect to receive this:
€9M
CAD$17.5K
$10.8 Millions
EUR35M
JPY 1.78 Billion
The pattern that I use (throw rotten tomatoes at me):
try:
pattern = '(\bAU|\bUSD|\bUS|\bCHF)*\s*[\$\€\£\¥\₣\₹\?]\s*\d*\.?\d*\s*(K|M)*[(B|M)illion]*'
raises = re.search(pattern, text, re.IGNORECASE) # text – a row of data mentioned above
raises = raises.group().upper().strip()
print(raises)
except:
raises = '???'
print(raises)
Also, sometimes the pattern that works in online python regex editor, will not work in actual script.
Some issues in your regex:
The list of currency acronyms (AU USD US CHF) is too limited. It will not match JPY, nor many other acronyms. Maybe allow any word of 2-3 capitals.
Not a problem, but there is no need to escape the currency symbols with a backslash.
The \? in the currency list is not a currency symbol.
The regex requires both a currency acronym as a currency symbol. Maybe you intended to make the currency symbol optional with \? but then that the ? should appear unescaped after the character class, and there should still be a possibility to not have the acronym and only the symbol.
The regex requires that the number has decimals. This should be made optional.
(K|M)* will allow KKKKKKK. You don't want a * here.
[(B|M)illion]* will allow the letters BMilon, a literal pipe and literal parentheses to occur in any order and any number. Like it will match "in" and "non" and "(BooM)"
The previous two mentioned patterns are put in sequence, while they should be mutually exclusive.
The regex does not provide for matching the final "s" in "millions".
Here is a correction:
(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?
On regex101
In Python syntax:
pattern = r"(?:\b[A-Z]{2,3}\s*[$€£¥₣₹]?|[$€£¥₣₹])\s*\d+(?:\.\d+)?(?:\s*(?:K|[BM](?:illions?)?)\b)?"
I'm usually pretty good with Regex but I'm struggling with this one. I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string. Or if that is too difficult, at least matches cbd if the phrase central business district doesn't appear anywhere before the term cbd. Only the cbd part should be returned as the result, so I'm using lookaheads/lookbehinds, but I have not been able to meet the requirements...
Input examples:
GOOD
Any products containing CBD are to be regulated.
BAD Properties located within the Central Business District (CBD) are to be regulated
I have tried:
(?!central business district)cbd
(.*(?!central business district).*)cbd
This is in Python 3.6+ using the re module.
I know it would be easy to accomplish with a couple lines of code, but we have a list of regex strings in a database that we are using to search a corpus for documents that contain any one of the regex strings from the DB. It is best to avoid hard-coding any keywords into the scripts because then it would not be clear to our other developers where these matches are coming from because they can't see it in the database.
Use PyPi regex with
import regex
strings = [' I need a regular expression that matches the term cbd but not if the phrase central business district appears anywhere else in the search string.', 'I need cbd here.']
for s in strings:
x = regex.search(r'(?<!central business district.*)cbd(?!.*central business district)', s, regex.S)
if x:
print(s, x.group(), sep=" => ")
Results: I need cbd here. => cbd. See Python code.
Explanation
--------------------------------------------------------------------------------
(?<! look behind to see if there is not:
--------------------------------------------------------------------------------
central business 'central business district'
district
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
) end of look-behind
--------------------------------------------------------------------------------
cbd 'cbd'
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
.* any character except \n (0 or more times
(matching the most amount possible))
--------------------------------------------------------------------------------
central business 'central business district'
district
--------------------------------------------------------------------------------
) end of look-ahead
I am trying to clean up text for use in a machine learning application. Basically these are specification documents that are "semi-structured" and I am trying to remove the section number that is messing with NLTK sent_tokenize() function.
Here is a sample of the text I am working with:
and a Contract for the work and/or material is entered into with some other person for a
greater amount, the undersigned hereby agrees to forfeit all right and title to the
aforementioned deposit, and the same is forfeited to the Crown.
2.3.3
...
(b)
until thirty-five days after the time fixed for receiving this tender,
whichever first occurs.
2.4
AGREEMENT
Should this tender be accepted, the undersigned agrees to enter into written agreement with
the Minister of Transportation of the Province of Alberta for the faithful performance of the
works covered by this tender, in accordance with the said plans and specifications and
complete the said work on or before October 15, 2019.
I am trying to remove all the section breaks (ex. 2.3.3, 2.4, (b)), but not the date numbers.
Here is the regex I have so far: [0-9]*\.[0-9]|[0-9]\.
Unfortunately it matches part of the date in the last paragraph (2019. turns into 201) and I really dont know how to fix this being a non-expert at regex.
Thanks for any help!
You may try replacing the following pattern with empty string
((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))
output = re.sub(r'((?<=^)|(?<=\n))(?:\d+(?:\.\d+)*|\([a-z]+\))', '', input)
print(output)
This pattern works by matching a section number as \d+(?:\.\d+)*, but only if it appears as the start of a line. It also matches letter section headers as \([a-z]+\).
To your specific case, I think \n[\d+\.]+|\n\(\w\) should works. The \n helps to diferentiate the section.
The pattern you tried [0-9]*\.[0-9]|[0-9]\. is not anchored and will match 0+ digits, a dot and single digit or | a single digit and a dot
It does not take the match between parenthesis into account.
Assuming that the section breaks are at the start of the string and perhaps might be preceded with spaces or tabs, you could update your pattern with the alternation to:
^[\t ]*(?:\d+(?:\.\d+)+|\([a-z]+\))
^ Start of string
[\t ]* Match 0+ times a space or tab
(?: Non capturing group
\d+(?:\.\d+)+ Match 1+ digits and repeat 1+ times a dot and 1+ digits to match at least a single dot to match 2.3.3 or 2.4
|
\([a-z]+\) Match 1+ times a-z between parenthesis
) Close non capturing group
Regex demo | Python demo
For example using re.MULTILINE whers s is your string:
pattern = r"^(?:\d+(?:\.\d+)+|\([a-z]+\))"
result = re.sub(pattern, "", s, 0, re.MULTILINE)
I am trying to parse text from document using regex. Document contains different structure i.e. section 1.2, section (1). Below regex is able to parse text with decimal point but fails for ().
Any suggestion to handle content which starts with ().
For example:
import re
RAW_Data = '(4) The Governor-General may arrange\n with the Chief Minister of the Australian Capital Territory for the variation or revocation of an \n\narrangement in force under subsection (3). \nNorthern Territory \n (5) The Governor-General may make arrangements with the \nAdministrator of the Northern \nTerritory with respect to the'
f = re.findall(r'(^\d+\.[\d\.]*)(.*?)(?=^\d+\.[\d\.]*)', RAW_Data,re.DOTALL|re.M|re.S)
for z in f:
z=(''.join(z).strip().replace('\n',''))
print(z)
Expected output:
(4) The Governor-General may arrange with the Chief Minister of the Australian Capital Territory for the variation or revocation of an arrangement in force under subsection
(3) Northern Territory
(5) The Governor-General may make arrangements with the Administrator of the Northern Territory with respect to the'
Use regex, [sS]ection\s*\(?\d+(?:\.\d+)?\)?
The (?\d+(?:\.\d+)?\)? will match any number with or without decimal or a brace
Regex
You can try:
(?<=(\(\d\)|\d\.\d))(.(?!\(\d\)|\d\.\d))*
To understand how it works, consider the following block:
(\(\d\)|\d\.\d)
It looks for strings which are of type (X) or X.Y, where X and Y are numbers. Let's call such string 'delimiters'.
Now, the regex above, looks for the first character preceeded by a delimiter (positive lookbehind) and matches the following characters until it finds one which is followed by the delimiter (negative lookhaed).
Try it here!
Hope it helps!
There are a new RegEx \(\d\)[^(]+
\(\d\) match any string like (1) (2) (3) ...
[^(]+ match one or more char and stop matching when found (
test on : on Regex101
But i wonder if you have a special example like (4) The Governor-General may arrange\n with the Chief Minister of the Austr ... (2) (3). \nNorthern Territory \n. It is a sentence from (4) to (2). Because my regex can not match this type of sentence.
I'm sure I'm just missing something, but my regex is a little rusty.
I have a well formatted text corpus and it came out of a SQLite DB that had each review as a row, which is fine and I wrote it out that way to a text file, so each review is a line followed by a new line character.
What I need to do is convert every sentence into a line to feed an iterator that expects sentences as lines that then feeds a model. The text is all professionally written and edited, so a simple regex that splits lines based on strings ending in [.!?] or [.!?] followed by a double quotation mark (") is actually sufficient. something like
re.split('(?<=[.!?]) +|((?<=[.!?])\")', text)
The lookbehind works for anything except ("). I've usually done regex mostly in R or Ruby and this is just making me feel dumb in the wee hours of Sunday night.
Example text:
“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.” But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule.
Thanks in advance for any suggestions.
You may use
r'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'
See the regex demo
Details
(?: - start of a non-capturing alternation group matching:
(?<=[.!?]) - a position that is immediately preceded with ., ! or ?
| - or
(?<=[.!?]["”]) - a position that is immediately preceded with ., ! or ? followed with " or ”
) - end of the grouping
\s+ - 1+ whitespaces.
Python 2 demo:
import re
rx = ur'(?:(?<=[.!?])|(?<=[.!?]["”]))\s+'
s = u"“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.” But today, the much-maligned subgenre almost feels like a secret precedent. Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule."
for result in re.split(rx, s):
print(result.encode("utf-8"))
Output:
“Trip-hop” eventually became a ’90s punchline, a music-press shorthand for “overhyped hotel lounge music.”
But today, the much-maligned subgenre almost feels like a secret precedent.
Listen to any of the canonical Bristol-scene albums of the mid-late ’90s, when the genre was starting to chafe against its boundaries, and you’d think the claustrophobic, anxious 21st century started a few years ahead of schedule.