I have a text scattered with various strings, dates, tab characters and language codes. I want to extract the strings that follow a date+tab combination, and which are followed by a language code like '[en]', a tab character, and after which we don't have the string "BAD THINGS" (e.g. "2020-01-12\tSTRING WE NEED[en]\tGOOD THINGS", as opposed to "2020-01-12\tSTRING WE DON'T NEED[en]\tBAD THINGS").
Here is a short example text of what I'm working with:
\n2021-01-12\tThis string is not needed [it]\tBad things\tBad things\n2021-01-12\tThis string is also not needed [en]\tBad things\tBad things\n2021-01-11\tString 1 that is needed! [it]\tString 1 that is needed! is repeated here\tNot interesting here\n2021-01-11\tString 2 that is needed [fr]\tString 2 that is needed is repeated here\tUnnecessary string\n2021-01-11\tString 3 that is needed... [ru]\tString 3 that is needed... is repeated here\tAnother part we're not interested in
I made this regex to capture all strings between a date and a language code:
(\d{4}-\d{2}-\d{2}\\t)(.*?)(\[\w{2}\]\\t)
This works fine (see here). However, when I add a negative lookahead to exclude those followed by "Bad things", all my regex goes south:
(\d{4}-\d{2}-\d{2}\\t)(.*?)(\[\w{2}\]\\t)(?!Bad things)
You can see the result here. I understand my lookahead somehow makes the regex greedy, but I have no idea how to avoid this, adding a ? after it doesn't work. Can you help me out here?
Not sure if this will cover all the cases but this seems to work:
(\d{4}-\d{2}-\d{2}\\t)([^][]*)(\[\w{2}\]\\t)(?!Bad things)
Demo here.
Explanation:
(\d{4}-\d{2}-\d{2}\\t) date and tab
([^][]*) collect only things that do not contain chars `[` and `]`
(\[\w{2}\]\\t) follow up [<tag>]
(?!Bad things) Negative Lookahead
Related
Sorry for the somewhat unhelpful title, I'm having a really hard time explaining this issue.
I have a list of unique identifiers that can appear in a number of different ways and I'm trying to use regex to normalize them so I can compare across several databases. Here are some examples of them:
AB1201
AB-1201
AB1201-T
AB-12-01L1
AB1201-TER
AB1201 Transit
I've written a line of code that pulls out all hypens and spaces, and the used this regex:
([a-zA-Z]{2}[\d]{4})(L\d|Transit|T$)?
This works exactly as expected, returning a list looking like this:
AB1201
AB1201
AB1201T
AB1201L1
AB1201
AB1201T
The issue is, I have one identifier that looks like this: AB1201-02. I need this to be raised as an exception, and not included as a match.
Any ideas? I'm happy to provide more clarification if necessary. Thanks!
From Regex101 online tester
You can exclude matching the following hyphen and a digit (?!-\d) using a negative lookahead.
If it should start at the beginning of the string, you could use an anchor ^
Note that you could write [\d] as \d
^([a-zA-Z]{2}\d{4})(?!-\d)(L\d|Transit|T$)?
The pattern will look like
^ Start of string
( Capture group 1
[a-zA-Z]{2}\d{4} Match 2 times a-zA-Z and 4 digits
) Close group
(?!-\d) Negative lookahead, assert what is directly to the right is not - and a digit
(L\d|Transit|T$)? Optional capture group 2
Regex demo
Try this regular expression
^([a-zA-Z]{2}[\d]{4})(?!-\d)(L\d|Transit|T|-[A-Z]{3})?$
I have added the (?!...) Negative Lookahead to avoid matching with the -02.
(?!...) Negative Lookahead: Starting at the current position in the expression, ensures that the given pattern will not match. Does not consume characters.
You can view a demo on this link.
I am trying to split text of clinical trials into a list of fields. Here is an example doc: https://obazuretest.blob.core.windows.net/stackoverflowquestion/NCT00000113.txt. Desired output is of the form: [[Date:<date>],[URL:<url>],[Org Study ID:<id>],...,[Keywords:<keywords>]]
I am using re.split(r"\n\n[^\s]", text) to split at paragraphs that start with a character other than space (to avoid splitting at the indented paragraphs within a field). This is all good, except the resulting fields are all (except the first field) missing their first character. Unfortunately, it is not possible to use string.partition with a regex.
I can add back the first characters by finding them using re.findall(r"\n\n[^\s]", text), but this requires a second iteration through the entire text (and seems clunky).
I am thinking it makes sense to use re.findall with some regex that matches all fields, but I am getting stuck. re.findall(r"[^\s].+\n\n") only matches the single line fields.
I'm not so experienced with regular expressions, so I apologize if the answer to this question is easily found elsewhere. Thanks for the help!
You may use a positive lookahead instead of a negated character class:
re.split(r"\n\n(?=\S)", text)
Now, it will only match 2 newlines if they are followed with a non-whitespace char.
Also, if there may be 2 or more newlines, you'd better use a {2,} limiting quantifier:
re.split(r"\n{2,}(?=\S)", text)
See the Python demo and a regex demo.
You want a lookahead. You also might want it to be more flexible as far as how many newlines / what newline characters. You might try this:
import re
r = re.compile(r"""(\r\n|\r|\n)+(?=\S)""")
l = r.split(text)
though this does seem to insert \r\n characters into the list... Hmm.
I've looked at this answer and this answer to try to figure out my problem, but I'm not sure they're directly applicable because a) I don't have a condition that always has to be met, and b) the document is so messy that allowing for any of the three to match would result in a large amount of false positives.
So, with that being said, here is my issue. I have lines of text that I want to match that look like this:
x = "10/04 Some brief description blah blah blah 45.00"
where the spacing between everything is messy. Then, I have some lines of text that I want to match that look like this:
y = "VJ../VI Another stupid brief description 1000.00"
z = "11/13 This is another description LO05.13"
The regular expression I'm currently using is this:
regex = r"^(\d\d\s?[1/]\s?\d\d\s?[1/]\d\d)\s+(\S+(?:\s+\S+)*?)\s+(-?\s?[\d,]+\.\d\d)"
The problem is that in y regex doesn't match because there is no date at the beginning of the string; the OCR process messed up. However, we still know that it's a valid line because it has a description and an amount. regex won't match z either because the amount is not a bunch of digits, but we know it's a transaction because there's a date and a description.
I've considered changing the regex to look like this:
regex = r"^(\d\d\s?[1/]\s?\d\d\s?[1/]\d\d\s+)?(\S+(?:\s+\S+)*?)\s+(-?\s?[\d,]+\.\d\d)?"
But I'm worried that that will just match everything in the document (i.e. "Withdrawals and Debits"). And since the two optional pieces of the line of text are on opposite ends of the more consistent piece of the text, I'm not sure how to implement | like in the solutions to the questions I linked.
Is my best option to just make two different regular expressions, linked with |, like so?
regex = r"^(\d\d\s?[1/]\s?\d\d\s?[1/]\d\d\s+)?(\S+(?:\s+\S+)*?)\s+(-?\s?[\d,]+\.\d\d)|^(\d\d\s?[1/]\s?\d\d\s?[1/]\d\d)\s+(\S+(?:\s+\S+)*?)\s+(-?\s?[\d,]+\.\d\d)?"
Any assistance would be appreciated. Thanks
With OCR inputs, it is hard to work out a 100% safe approach. Without the actual output to look at, we can only suggest a general idea on how to deal with each concrete case.
Here, I suggest
r'^(\w+[^\s/]*/\w{2}\b.*?)\s*(\d+\.\d{2})$'
See the regex demo
The pattern is rather a general one:
^ - start of string/line
(\w+[^\s/]*/\w{2}\b.*?) - 1+ alphanumeric symbols or underscore (perhaps, \w+ could be replaced with \w) followed with 0+ non-whitespace and non-/ characters followed with /, then followed with exactly 2 "word" characters followed with a word boundary \b and then as few as possible 0+ characters other than a newline
\s* - 0+ whitespace
(\d+\.\d{2}) - the final float number that can have 1+ digits in the integer part and 2 in the decimal part
$ - end of string/line
Playing around with the limiting quantifier and character classes, you can further fine tune the pattern.
I think the solution suggested in the title is to break the things you are looking for into a series of more focused regex's and then see how many of them you meet.
For example I made:
regex = r"\d\d/\d\d"
regex_2 = r".*\s[\d]+\.\d\d"
Then did:
for i in [x,y,z]:
tests = [re.match(regex, i), re.match(regex_2, i)]
print sum([1 if j else 0 for j in tests])
And got:
2
1
1
I'd need more info before writing the third regex for the description, but I think this is the way forward.
In the following text, I try to match a number followed by ")" and number followed by a period. I am trying to retrieve the text between the matches.
Example:
"1) there is a dsfsdfsd and 2) there is another one and 3) yet another
case"
so I am trying to output: ["there is a dsfsdfsd and", "there is another one and", yet another case"]
I've used this regex: (?:\d)|\d.)
Adding a .* at the end matches the entire string, I only want it to match the words between
also in this string:
"we will give 4. there needs to be another option and 6.99 USD is a
bit amount"
I want to only match the 4. and not the 6.99
Any pointers will be appreciated. Thank you. r
tldr
Regular expressions are tricky beasts and you should avoid them if at all possible.
If you can't avoid them, then make sure you have lots of test cases for all the edge cases that can occur.
Build up your regular expression slowly and systematically, testing your assumptions at every step.
If this code will go intro production, then please write unit tests that explain the thinking process to the poor soul who has to maintain it one day
The long version
Regular expressions are finicky. Your best approach may be to solve the problem a different way.
For example, your language might have a library function that allows you to split up strings using a regular expression to define what comes between the numbers. That will let you get away with writing a simpler regex to match the numbers and brackets/dots.
If you still decide to use regular expressions, then you need to be very structured about how you build up your regular expressions. It's extremely easy to miss edge cases.
So let's break this down piece by piece...
Set up a test environment for quickly experimenting with your regex.
There are lots of options here, depending on your programming language and OS. Ones I sometimes use are:
a Powershell window for testing .Net regexes (NB: the cli gives you a history of past attempts, so you can go back a few steps if you mess things up too badly)
a Python console for testing Python regexes (which are slightly different to .Net regexes in their syntax for named capture groups).
an html page with JavaScript to test the regex
an online or desktop regex tool (I still use the ancient Regular Expression Workbench from Eric Gunnerson, but I'm sure there are better alternatives these days)
Since you didn't specify a language or regex version, I'll assume .Net regular expressions
Create a single test string for testing a wider variety of options.
Your goal is to include as many edge cases as you can think of. Here's what I would use: "ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10."
Note that I've added a few extra cases you didn't mention:
empty strings between two round bracket numbers: "4)" and "5)"
white space string between two round bracket numbers: "5)" and "6)"
empty strings between a round bracket number and a dotted number: "6)" and "10."
empty string after the dotted number "10." at the end of the string
random text and empty space, which should be ignored, before the first number
I'm going to make a few assumptions here, which you will need to vary based on your actual requirements:
You DO want to capture the white space after the dot or round bracket.
You DO want to capture the white space before the next dotted number or round bracket number.
You might have numbers that go beyond 9, so I've included "10" in the test cases.
You want to capture empty strings at the end e.g. after the "10."
NOTES:
Thinking through this test case forces you to be more rigorous about your requirements.
It will also help you be more efficient while you are manually testing your regular expression.
HOWEVER, this is assuming you aren't following a TDD approach. If you are, then you should probably do things a little differently... create unit tests for each scenario separately and get the regex working incrementally.
This test string doesn't cover all cases. For example, there are no newline or tab characters in the test string. Also it can't test for an empty string following a round bracket number at the very end.
First get a regex working that just captures the round brackets and dotted brackets.
Don't worry about the $6.99 edge case yet.
Drop the "(?:" non-capturing group syntax from your regex for now: "\d)|\d."
This doesn't even parse, because you have an unescaped round bracket.
The revised string is "\d\)|\d.", which parses, but which also matches "99" which you probably weren't expecting. That's because you forgot to escape the "."
The revised string is "\d\)|\d\.". This no longer matches "99", but it now matches "0." at the end instead of "10.". That's because it assumes that numbers will be single digit only.
The following string seems to work: "\d+\)|\d+\."
Time to deal with that pesky "$6.99" now...
Modify the regex so that it doesn't capture a floating point number.
You need to use a negative look ahead pattern to prevent a digit being after the decimal point.
Result: "\d+\)|\d+\.(?!\d)"
Count how many matches this produces. You're going to use this number for checking later results.
Hint: Save the regex pattern somewhere. You want to be able to go back to it any time you mess up your regex pattern beyond repair.
If you found a string splitting function, then you should use it now and avoid the complexity that follows. [I've included an example of this at the end.]
Simple is better, but I'm going to continue with the longer solution in the interests of showing an approach to staying in control of regex'es that start getting horribly complicated
Decide how to exclude that pattern
You used the non-capture group pattern in your question i.e. "(?:"
That approach can work. But it's a bit cumbersome, because you need to have a capturing group after it that you will look for instead.
It would be much nicer if your entire pattern matched what you are looking for.
So wrap the number pattern inside a zero-width positive look behind pattern (if your language supports it) i.e. "(?<=".
This checks for the pattern, but doesn't include it in what gets captured.
So now your regex looks like this: "(?<=\d+\)|\d+\.(?!\d))"
Test it!
It might seem silly to test this on its own - all the matches are empty strings.
Do it anyway. You want to sanity check every step of the way.
Make sure that it still produces the same number of matches as in step 4.
Decide how to match the text in between the numbers.
You rightly mention that ".*" will match the entire string, not just the parts in between.
There's a neat trick that allows you to reuse the pattern from step 5 to get the text in between.
Start by just matching the next character
The trick is that you want to match any character unless it's the start of the next number
That sounds like a negative look ahead pattern again: "(?!"
Let X be the pattern you saved in step 4. Matching a single character will look like this: "(?!X)."
You want to match lots of those characters. So put that pattern into a non-capturing group and repeat it: "(?:(?!X).)*"
This assumes you want to capture empty text.
If you're not, then change the "*" to a "+".
Hint: This is such a common pattern that you will want to reuse it in future pasting in different patterns in place of X
I used a non-capturing group instead of a normal group so that you can also embed this pattern in regexes where you do care about the capturing groups
Resulting pattern: "(?:(?!\d+\)|\d+\.(?!\d)).)*"
I suggest testing this pattern on its own to see what it does
Now put parts 5 and 7 together: "(?<=\d+\)|\d+\.(?!\d))(?:(?!\d+\)|\d+\.(?!\d)).)*"
Test it!
Unit tests!
If this is going into production, then please write lots of unit tests that will explain each step of this thought process
Have pity on the poor soul who has to maintain your regex in future!
By rights that person should be you
I suggest putting a note in your calendar to return to this code in 6 months' time and make sure you can still understand it from the unit tests alone!
Refactor
In six months' time, if you can't understand the code any more, use your newfound insight (and incentive) to solve the problem without using regular expressions (or only very simple ones)
Addendum
As an example of using a string splitting function to get away with a simpler regex, here's a solution in Powershell:
$string = 'ab 1. there is a dsfsdfsd costing $6.99 and 2) there is another one and 3. yet another case 4)5) 6)10.'
$pattern = [regex] '\d+\)|\d+\.(?!\d)'
$string -split $pattern | select-object -skip 1
Judging by the task you have, it might be easier to match the delimiters and use re.split (as also pointed out by bobblebubble in the comments).
I dsuggest a mere
\d+[.)]\B\s*
See it in action (demo)
It matches 1 or more digits, then a . or a ), then it makes sure there is no word letter (digit, letter or underscore) after it and then matches zero or more whitespace.
Python demo:
import re
rx = r'\d+[.)]\B\s*'
test_str = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case\n\"we will give 4. there needs to be another option and 6.99 USD is a bit amount"
print([x for x in re.split(rx,test_str) if x])
Try the following regex with the g modifier:
([A-Za-z\s\-_]+|\d(?!(\)|\.)\D)|\.\d)
Example: https://regex101.com/r/kB1xI0/3
[A-Za-z\s\-_]+ automatically matches all alphabetical characters + whitespace
\d(?!(\)|\.)\D) match any numeric sequence of digits not followed by a closing parenthesis ) or decimal value (.99)
\.\d match any period followed by numeric digit.
I used this pattern:
(?<=\d.\s)(.*?)(?=\d.\s)
demo
This looks for the contents between any digit, any character, then a space.
Edit: Updated pattern to handle the currency issue and line ends better:
This is with flag 'g'
(?<=[0-9].\s)(.*?)(?=\s[0-9].\s|\n|\r)
Demo 2
import re
s = "1) there is a dsfsdfsd and 2) there is another one and 3) yet another case"
s1 = "we will give 4. there needs to be another option and 6.99 USD is a bit amount"
regex = re.compile("\d\)\s.*?|\s\d\.\D.*?")
print ([x for x in regex.split(s) if x])
print regex.split(s1)
Output:
['there is a dsfsdfsd and ', 'there is another one and ', 'yet another case']
['we will give', 'there needs to be another option and 6.99 USD is a bit amount']
I'm trying to split a paragraph into sentences using regex split and I'm trying to use the second answer posted here:
a Regex for extracting sentence from a paragraph in python
But I have a list of abbreviations that I don't want to end the sentence on even though there's a period. But I don't know how to append it to that regular expression properly. I'm reading in the abbreviations from a file that contains terms like Mr. Ms. Dr. St. (one on each line).
Short answer: You can't, unless all lookbehind assertions are of the same, fixed width (which they probably aren't in your case; your example contained only two-letter abbreviations, but Mrs. would break your regex).
This is a limitation of the current Python regex engine.
Longer answer:
You could write a regex like (?s)(?<!.Mr|Mrs|.Ms|.St)\., padding each alternating part of the lookbehind assertion with as many .s as needed to get all of them to the same width. However, that would fail in some circumstances, for example when a paragraph begins with Mr..
Anyway, you're not using the right tool here. Better use a tool designed for the job, for example the Natural Language Toolkit.
If you're stuck with regex (too bad!), then you could try and use a findall() approach instead of split():
(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*
would match a sentence that ends in . (optionally followed by whitespace) and may contain no dots unless preceded by one of the allowed abbreviations.
>>> import re
>>> s = "My name is Mr. T. I pity the fool who's not on the A-Team."
>>> re.findall(r"(?:(?:\b(?:Mr|Ms|Dr|Mrs|St)\.)|[^.])+\.\s*", s)
['My name is Mr. T. ', "I pity the fool who's not on the A-Team."]
I don't directly answer your question, but this post should contain enough information for you to write a working regex for your problem.
You can append a list of negative look-behinds. Remember that look-behinds are zero-width, which means that you can put as many look-behinds as you want next to each other, and you are still look-behind from the same position. As long as you don't need to use "many" quantifier (e.g. *, +, {n,}) in the look-behind, everything should be fine (?).
So the regex can be constructured like this:
(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+
It is a bit too verbose. Anyway, I write this post just to demonstrate that it is possible to look-behind on a list of fixed string.
Example run:
>>> s = 'something patterning of patterned crap patternon not patterner, not allowed patternes to patternsses, patternet'
>>> re.findall(r'(?<!list )(?<!of )(?<!words )(?<!not )(?<!allowed )(?<!to )(?<!precede )pattern\w+', s)
['patterning', 'patternon', 'patternet']
There is a catch in using look-behind, though. If there are dynamic number of spaces between the blacklisted text and the text matching the pattern, the regex above will fail. I really doubt there exists a way to modify the regex so that it works for the case above while keeping the look-behinds. (You can always replace consecutive spaces into 1, but it won't work for more general cases).