I am building a ML training dataset from a corpus using some chemical named entities.
The reason I mention the chemical context is just to assure that this is a realistic example of what I am dealing with, not a made up one.
In doing so, I need a regex expression that has the following structure:
1 - Starts by the chemical formula string "2h-tetrazolium, 2,2'-(3,3'-dimethoxy[1,1'-biphenyl]-4,4'-diyl)bis[3-(4-nitrophenyl)-5-phenyl-,chloride (1:2)"
2 - followed by 0 up to 15 characters
3 - followed by the chemical code string "298-83-9"
4 - followed by 0 up to 15 characters
5 - followed by a non-alphanumerical character
6 - followed by the string "5"
7 - ends with a non-alphanumerical value.
The reason that I added the non-alphanumerical requirements #5 and #7 is that the text in which the regex search is to be performed is a long messy text and I wanted to ensure that the string "5" is not part of another entity such as these two examples: "bluh bluh 298-83-9 bluh bluh 564" or "bluh bluh 298-83-9 bluh bluh 645".
The way I approached was building an expression like the following:
reg_exp = name_entity[0] + r".{0,15}\s*" + name_entity[1] + r".{0,15}\s*" + r"[^a-zA-Z\d]+" + name_entity[2] + r"[^a-zA-Z\d]+"
where name_entity is the array that contains the strings in requirements 1, 3, and 6.
However, the issue is that the chemical formula and code in requirements 1 and 3 have so much escaping, hyphens, etc that my expression does not work. I need a way to prompt regex in thinking that name_entity elements are to be treated as exactly literal phrases, not containing some regex expression.
In case it matters, I am coding in Python.
I would appreciate your help. Here, I copy a portion of the multi-page long text that is intended to contain what the the regex expression is intended to find. The part that my python code re.findall(reg_exp, text) should find is bolded:
"composition/information on ingredients substance / mixture : mixture substance name : nbt/bcip stock solution, mbf components chemical name cas-no. concentration (% w/w) methane, 1,1'-sulfinylbis- 67-68-5 >= 50 - < 70 2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 actual concentration is withheld as a trade secret section 4. first aid measures general advice : do not leave the victim unattended. safety data sheet nbt/bcip stock solution version 3.0 revision date: 09-25-2019"
There's a few issues here, but it works with the following code:
def new_regex(entity):
return fr"{re.escape(entity[0])}.{{0,15}}\s*{re.escape(entity[1])}.{{0,15}}\s*[^a-zA-Z\d]+{re.escape(entity[2])}[^a-zA-Z\d]+"
entity = [
"2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2)",
'298-83-9',
'5'
]
n = "composition/information on ingredients substance / mixture : mixture substance name : nbt/bcip stock solution, mbf components chemical name cas-no. concentration (% w/w) methane, 1,1'-sulfinylbis- 67-68-5 >= 50 - < 70 2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 actual concentration is withheld as a trade secret section 4. first aid measures general advice : do not leave the victim unattended. safety data sheet nbt/bcip stock solution version 3.0 revision date: 09-25-2019"
regex = new_regex(entity)
regex.findall(n)
# ["2h-tetrazolium, 2,2'-(3,3'- dimethoxy[1,1'-biphenyl]-4,4'- diyl)bis[3-(4-nitrophenyl)-5-phenyl-, chloride (1:2) 298-83-9 >= 1 - < 5 "]
This was fixed by using re.escape, as well as fixing a few issues with whitespace in your chemical formula. You likely however want to change your entities to handle whitespace better.
Related
I'm working on project which require to extract all the case number from the given string. Can anyone please help me to create a regex to match the pattern for all the case numbers.
Pattern is like: alphanumeric must followed with / alphanumeric must followed with / alphanumeric
*Housekeeping Services For the period( 1‐03‐2020 to 31‐03‐2020) ‐ HDC ‐5i
SL.NO HSN/SAC
Code UOM
Facility
Approved
HC
Total Billing
Hours
Actual Manpower
HC
Unit Rate Per
Month Taxable Value
1 HK Supervisor 9985 HR 4 832 4.00 18,644.00 7 4,576.00*
Case no.**MH20/00285/VAS**
Case no. **MH20/00294/GVN1**
Case no. **MH20/000026/MUMR**
Case no. **KA20/00346/BN**
Case no. **DL20/0024/DLH39**
Case no. **MH20/003B30/GUR2**
Case no. **GJ20/001A75/GJ**
Case no. **GJ20/001A77/GJ**
Case no. **MH20/002CK89/GVN1**
*3,15,962.69
2 8,436.64
2 8,436.64
3,72,836.00
AMOUNT IN WORDS:‐ Rupees Three Lakhs Seventy Two Thousand Eight Hundred Thirty Six Only*
This one should do the Job
[\d\w]{4}/[\d\w]+/[\d\w]+
I'm trying to extract ONLY one string that contains $ character. The input based on output that I extracted using BeautifulSoup.
Code
price = [m.split() for m in re.findall(r"\w+/$(?:\s+\w+/$)*", soup_content.find('blockquote', { "class": "postcontent restore" }).text)]
Input
For Sale is my Tag Heuer Carrera Calibre 6 with box and papers and extras.
39mm
47 ish lug to lug
19mm in between lugs
Pretty thin but not sure exact height. Likely around 12mm (maybe less)
I've owned it for about 2 years. I absolutely love the case on this watch. It fits my wrist and sits better than any other watch I've ever owned. I'm selling because I need cash and other pieces have more sentimental value
I am the second owner, but the first barely wore it.
It comes with barely worn blue leather strap, extra suede strap that matches just about perfectly and I'll include a blue Barton Band Elite Silicone.
I also purchased an OEM bracelet that I personally think takes the watch to a new level. This model never came with a bracelet and it was several hundred $ to purchase after the fact.
The watch was worn in rotation and never dropped or knocked around.
The watch does have hairlines, but they nearly all superficial. A bit of time with a cape cod cloth would take care of a lot it them. The pics show the imperfections in at "worst" possible angle to show the nature of scratches.
The bracelet has a few desk diving marks, but all in all, the watch and bracelet are in very good shape.
Asking $2000 obo. PayPal shipped. CONUS.
It's a big hard to compare with others for sale as this one includes the bracelet.
The output should be like this.
2000
You don't need a regex. Instead you can iterate over lines and over each word to check for starting with '$' and extract the word:
[word[1:] for line in s.split('\n') for word in line.split() if word.startswith('$') and len(word) > 1]
where s is your paragraph.
which outputs:
['2000']
Since this is very simple you don't need a regex solution, this should sufice:
words = text.split()
words_with_dollar = [word for word in words if '$' in word]
print(words_with_dollar)
>>> ['$', '$2000']
If you don't want the dollar sign alone, you can add a filter like this:
words_with_dollar = [word for word in words if '$' in word and '$' != word]
print(words_with_dollar)
>>> ['$2000']
I would do something like that (provided input is the string you wrote above)-
price_start = input.find('$')
price = input[price_start:].split(' ')[0]
IF there is only 1 occurrence like you said.
Alternative- you could use regex like that-
price = re.findall('\S*\$\S*\d', input)[0]
price = price.replace('$', '')
I have the following sentence where I want to get rid of everything with the format '(number)/(... ; number)' :
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure (1; 2 ; 3). Many of these architectural proteins are
molecular bridges that can bind at two or more distinct DNA sites to
form loops. For example, bacterial DNA is looped and compacted by the
histonelike protein H-NS, which has two distinct DNA-binding domains
(4). In eukaryotes, complexes of transcription factors and RNA
polymerases stabilize enhancer-promoter loops (5; 6; 7 ; 8), while
HP1 (9), histone H1 (10), and the polycomb-repressor complex PRC1/2
(11 ; 12) organize inactive chromatin. Proteins also bind to specific
DNA sequences to form larger structures, like nucleoli and the
histone-locus, or Cajal and promyeloleukemia bodies (13; 14; 15; 16;
17 ; 18). The selective binding of molecular bridges to active and
inactive regions of chromatin has also been highlighted as one
possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions (6; 8
; 19).
I want it to be in the form:
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure . Many of these architectural proteins are molecular bridges
that can bind at two or more distinct DNA sites to form loops. For
example, bacterial DNA is looped and compacted by the histonelike
protein H-NS, which has two distinct DNA-binding domains . In
eukaryotes, complexes of transcription factors and RNA polymerases
stabilize enhancer-promoter loops , while HP1 , histone H1 , and the
polycomb-repressor complex PRC1/2 organize inactive chromatin.
Proteins also bind to specific DNA sequences to form larger
structures, like nucleoli and the histone-locus, or Cajal and
promyeloleukemia bodies . The selective binding of molecular bridges
to active and inactive regions of chromatin has also been highlighted
as one possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions .
My attempt was as follows:
import re
x=re.sub(r'\(.+; \d+\)', '', x) # eliminate brackets with multiple numbers
#### NOTE: there are 2 spaces between the last ';' and the last digit
x=re.sub(r'\d+\)', '', x) # eliminate brackets with single number
My output was this:
In all living organisms, from bacteria to man, DNA and chromatin are
invariably associated with binding proteins, which organize their
structure .
So clearly my code is missing something. I thought that '(.+)' would identify all brackets containing non-arbitrary characters and then I could further specify that I want all the ones ending in a '; number'.
I just want a flexible way of indexing a sentence at all places with '(number' and 'number)' and eliminate everything in between....
Maybe you can try to use the pattern
re.sub('\([0-9; ]+\)', '', x)
which removes all parenthesis that contein at least a number, a ";" or a space.
I think it's not the case to use the r prefix.
Try the following regex:
r'\s\((\d+\s?;?\s?)+\)'
This regex will match one or more groups of numbers (followed by spaces/semicolons) inside of parenthesis.
There seems to always be a space before the collection of numbers, so matching that should help with the "trailing space".
You can use a pattern like \(\d+(?:;\s?\d+\s?)*\), which matches an initial parentheses and digits ( <number>, and then any possible repeating ; <number>s that ends in ). Test it.
Or if you're feeling brave you can use \([;\d\s]+\) which just matches everything with digits/spaces/semicolons between two parentheses. Test it.
I found a glitch in your expected text, there's 1 space missing after PRC1/2. But this code works, with that space added back in:
text="""
In all living organisms, from bacteria to man, DNA and chromatin are invariably
associated with binding proteins, which organize their structure (1; 2 ; 3).
Many of these architectural proteins are molecular bridges that can bind at two
or more distinct DNA sites to form loops. For example, bacterial DNA is looped
and compacted by the histonelike protein H-NS, which has two distinct
DNA-binding domains (4). In eukaryotes, complexes of transcription factors and
RNA polymerases stabilize enhancer-promoter loops (5; 6; 7 ; 8), while HP1 (9),
histone H1 (10), and the polycomb-repressor complex PRC1/2 (11 ; 12) organize
inactive chromatin. Proteins also bind to specific DNA sequences to form larger
structures, like nucleoli and the histone-locus, or Cajal and promyeloleukemia
bodies (13; 14; 15; 16; 17 ; 18). The selective binding of molecular bridges to
active and inactive regions of chromatin has also been highlighted as one
possible mechanism underlying the formation of topologically associated domains
(TADs)—regions rich in local DNA interactions (6; 8 ; 19).
""".replace('\n', ' ')
expected="""
In all living organisms, from bacteria to man, DNA and chromatin are invariably
associated with binding proteins, which organize their structure . Many of
these architectural proteins are molecular bridges that can bind at two or more
distinct DNA sites to form loops. For example, bacterial DNA is looped and
compacted by the histonelike protein H-NS, which has two distinct DNA-binding
domains . In eukaryotes, complexes of transcription factors and RNA polymerases
stabilize enhancer-promoter loops , while HP1 , histone H1 , and the
polycomb-repressor complex PRC1/2 organize inactive chromatin. Proteins also
bind to specific DNA sequences to form larger structures, like nucleoli and the
histone-locus, or Cajal and promyeloleukemia bodies . The selective binding of
molecular bridges to active and inactive regions of chromatin has also been
highlighted as one possible mechanism underlying the formation of topologically
associated domains (TADs)—regions rich in local DNA interactions .
""".replace('\n', ' ')
import re
cites = r"\(\s*\d+(?:\s*;\s+\d+)*\s*\)"
edited = re.sub(cites, '', text)
i = 0
while i < len(edited):
if edited[i] == expected[i]:
print(edited[i], sep='', end='')
else:
print('[', edited[i], ']', sep='', end='')
i+=1
print('')
The regex I'm using is cites, and it looks like this:
r"\(\s*\d+(?:\s*;\s+\d+)*\s*\)"
The syntax r"..." means "raw", which for our purposes means "leave backslashes alone!" It's what you should (nearly) always use for regexes.
The outer \( and \) match the actual parens in the citation.
The \s* matches zero or more "white-space" characters, which include spaces, tabs, and newlines.
The \d+ matches one or more digits [0..9].
So to start with, there's a regex like:
\( \s* \d+ \s* \)
Which is just "parens around a number, maybe with spaces before or after."
The inner part,
(?:\s*;\s+\d+)*
says "don't capture": (?:...) is a non-capturing group, because we don't care about \1 or getting anything out of the pattern, we just want to delete it.
The \s*; matches optional spaces before a semicolon.
The \s+\d+ matches required spaces before another number - you might have to make those optional spaces if you have something like "(1;3;5)".
The * after the non-capturing group means zero or more occurrences.
Put it all together and you have:
open-paren
optional spaces
number
followed by zero or more of:
optional spaces
semicolon
required spaces
number
optional spaces
close-paren
Hi Anyone help me imporve my not working regular expresion.
Strings Cases:
1) 120 lbs and is intended for riders ages 8 years and up. #catch : 8 years and up
2) 56w x 28d x 32h inches recommended for hobbyists recommended for ages 12 and up. #catch : 12 and up
3) 4 users recorded speech for effective use language tutor pod measures 11l x 9w x 5h inches recommended for ages 6 and above. #catch : 6 and above
I want a genric regular expression which works perfectly for all the three string.
My regular expression is :
\b\d+[\w+\s]?(?:\ban[a-z]\sup\b|\ban[a-z]\sabove\b|\ban[a-z]\sold[a-z]*\b|\b&\sup)
But it is not working quite well. If anyone can provide me a generic regular expression which works for all 3 cases. I am using python re.findall()
Anyone? could Help?
Make it a habit and start with verbose regular expressions:
import re
rx = re.compile(r'''
ages\ # look for ages
(\d+(?:\ years)?\ and\ (?:above|up)) # capture a digit, years eventually
# and one of above or up
''', re.VERBOSE)
string = '''
1) 120 lbs and is intended for riders ages 8 years and up. #catch : 8 years and up
2) 56w x 28d x 32h inches recommended for hobbyists recommended for ages 12 and up. #catch : 12 and up
3) 4 users recorded speech for effective use language tutor pod measures 11l x 9w x 5h inches recommended for ages 6 and above. #catch : 6 and above
'''
matches = rx.findall(string)
print(matches)
# ['8 years and up', '12 and up', '6 and above']
See a demo on ideone.com as well as on regex101.com.
(As the suggestion I made in a comment appears to have been what you wanted, I offer it as an answer.)
If your examples illustrate all possible strings (but I fear they don't ;) you could do it as simple as
\d+[^\d]*$
See it here at regex101.
It matches the last number, and everything after it.
Or a little bit more sophisticated - making sure it's preceded by age - here
I'm trying to extract data from a few large textfiles containing entries about people. The problem is, though, I cannot control the way the data comes to me.
It is usually in a format like this:
LASTNAME, Firstname Middlename (Maybe a Nickname)Why is this text hereJanuary, 25, 2012
Firstname Lastname 2001 Some text that I don't care about
Lastname, Firstname blah blah ... January 25, 2012 ...
Currently, I am using a huge regex that splits all kindaCamelcase words, all words that have a month name tacked onto the end, and a lot of special cases for names. Then I use more regex to extract a lot of combinations for the name and date.
This seems sub-optimal.
Are there any machine-learning libraries for Python that can parse malformed data that is somewhat structured?
I've tried NLTK, but it could not handle my dirty data. I'm tinkering with Orange right now and I like it's OOP style, but I'm not sure if I'm wasting my time.
Ideally, I'd like to do something like this to train a parser (with many input/output pairs):
training_data = (
'LASTNAME, Firstname Middlename (Maybe a Nickname)FooBarJanuary 25, 2012',
['LASTNAME', 'Firstname', 'Middlename', 'Maybe a Nickname', 'January 25, 2012']
)
Is something like this possible or am I overestimating machine learning? Any suggestions will be appreciated, as I'd like to learn more about this topic.
I ended up implementing a somewhat-complicated series of exhaustive regexes that encompassed every possible use case using text-based "filters" that were substituted with the appropriate regexes when the parser loaded.
If anyone's interested in the code, I'll edit it into this answer.
Here's basically what I used. To construct the regular expressions out of my "language", I had to make replacement classes:
class Replacer(object):
def __call__(self, match):
group = match.group(0)
if group[1:].lower().endswith('_nm'):
return '(?:' + Matcher(group).regex[1:]
else:
return '(?P<' + group[1:] + '>' + Matcher(group).regex[1:]
Then, I made a generic Matcher class, which constructed a regex for a particular pattern given the pattern name:
class Matcher(object):
name_component = r"([A-Z][A-Za-z|'|\-]+|[A-Z][a-z]{2,})"
name_component_upper = r"([A-Z][A-Z|'|\-]+|[A-Z]{2,})"
year = r'(1[89][0-9]{2}|20[0-9]{2})'
year_upper = year
age = r'([1-9][0-9]|1[01][0-9])'
age_upper = age
ordinal = r'([1-9][0-9]|1[01][0-9])\s*(?:th|rd|nd|st|TH|RD|ND|ST)'
ordinal_upper = ordinal
date = r'((?:{0})\.? [0-9]{{1,2}}(?:th|rd|nd|st|TH|RD|ND|ST)?,? \d{{2,4}}|[0-9]{{1,2}} (?:{0}),? \d{{2,4}}|[0-9]{{1,2}}[\-/\.][0-9]{{1,2}}[\-/\.][0-9]{{2,4}})'.format('|'.join(months + months_short) + '|' + '|'.join(months + months_short).upper())
date_upper = date
matchers = [
'name_component',
'year',
'age',
'ordinal',
'date',
]
def __init__(self, match=''):
capitalized = '_upper' if match.isupper() else ''
match = match.lower()[1:]
if match.endswith('_instant'):
match = match[:-8]
if match in self.matchers:
self.regex = getattr(self, match + capitalized)
elif len(match) == 1:
elif 'year' in match:
self.regex = getattr(self, 'year')
else:
self.regex = getattr(self, 'name_component' + capitalized)
Finally, there's the generic Pattern object:
class Pattern(object):
def __init__(self, text='', escape=None):
self.text = text
self.matchers = []
escape = not self.text.startswith('!') if escape is None else False
if escape:
self.regex = re.sub(r'([\[\].?+\-()\^\\])', r'\\\1', self.text)
else:
self.regex = self.text[1:]
self.size = len(re.findall(r'(\$[A-Za-z0-9\-_]+)', self.regex))
self.regex = re.sub(r'(\$[A-Za-z0-9\-_]+)', Replacer(), self.regex)
self.regex = re.sub(r'\s+', r'\\s+', self.regex)
def search(self, text):
return re.search(self.regex, text)
def findall(self, text, max_depth=1.0):
results = []
length = float(len(text))
for result in re.finditer(self.regex, text):
if result.start() / length < max_depth:
results.extend(result.groups())
return results
def match(self, text):
result = map(lambda x: (x.groupdict(), x.start()), re.finditer(self.regex, text))
if result:
return result
else:
return []
It got pretty complicated, but it worked. I'm not going to post all of the source code, but this should get someone started. In the end, it converted a file like this:
$LASTNAME, $FirstName $I. said on $date
Into a compiled regex with named capturing groups.
I have similar problem, mainly because of the problem with exporting data from Microsoft Office 2010 and the result is a join between two consecutive words at somewhat regular interval. The domain area is morhological operation like a spelling-checker. You can jump to machine learning solution or create a heuristics solution like I did.
The easy solution is to assume that the the newly-formed word is a combination of proper names (with first character capitalized).
The Second additional solution is to have a dictionary of valid words, and try a set of partition locations which generate two (or at least one) valid words. Another problem may arise when one of them is proper name which by definition is out of vocabulary in the previous dictionary. perhaps one way we can use word length statistic which can be used to identify whether a word is a mistakenly-formed word or actually a legitimate one.
In my case, this is part of manual correction of large corpora of text (a human-in-the-loop verification) but the only thing which can be automated is selection of probably-malformed words and its corrected recommendation.
Regarding the concatenated words, you can split them using a tokenizer:
The OpenNLP Tokenizers segment an input character sequence into tokens. Tokens are usually words, punctuation, numbers, etc.
For example:
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
is tokenized into:
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
OpenNLP has a "learnable tokenizer" that you can train. If the doesn't work, you can try the answers to: Detect most likely words from text without spaces / combined words .
When splitting is done, you can eliminate the punctuation and pass it to a NER system such as CoreNLP:
Johnson John Doe Maybe a Nickname Why is this text here January 25 2012
which outputs:
Tokens
Id Word Lemma Char begin Char end POS NER Normalized NER
1 Johnson Johnson 0 7 NNP PERSON
2 John John 8 12 NNP PERSON
3 Doe Doe 13 16 NNP PERSON
4 Maybe maybe 17 22 RB O
5 a a 23 24 DT O
6 Nickname nickname 25 33 NN MISC
7 Why why 34 37 WRB MISC
8 is be 38 40 VBZ O
9 this this 41 45 DT O
10 text text 46 50 NN O
11 here here 51 55 RB O
12 January January 56 63 NNP DATE 2012-01-25
13 25 25 64 66 CD DATE 2012-01-25
14 2012 2012 67 71 CD DATE 2012-01-25
One part of your problem: "all words that have a month name tacked onto the end,"
If as appears to be the case you have a date in the format Monthname 1-or-2-digit-day-number, yyyy at the end of the string, you should use a regex to munch that off first. Then you have a now much simpler job on the remainder of the input string.
Note: Otherwise you could run into problems with given names which are also month names e.g. April, May, June, August. Also March is a surname which could be used as a "middle name" e.g. SMITH, John March.
Your use of the "last/first/middle" terminology is "interesting". There are potential problems if your data includes non-Anglo names like these:
Mao Zedong aka Mao Ze Dong aka Mao Tse Tung
Sima Qian aka Ssu-ma Ch'ien
Saddam Hussein Abd al-Majid al-Tikriti
Noda Yoshihiko
Kossuth Lajos
José Luis Rodríguez Zapatero
Pedro Manuel Mamede Passos Coelho
Sukarno
A few pointers, to get you started:
for date parsing, you could start with a couple of regexes, and then you could use chronic or jChronic
for names, these OpenNlp models should work
As for training a machine learning model yourself, this is not so straightforward, especially regarding training data (work effort)...