I need to extract headings and the chunk of text beneath them from a text file in Python using regular expression but I'm finding it difficult.
I converted this PDF to text so that it now looks like this:
So far I have been able to get all the numerical headers (12.4.5.4, 12.4.5.6, 13, 13.1, 13.1.1, 13.1.12) using the following regex:
import re
with open('data/single.txt', encoding='UTF-8') as file:
for line in file:
headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
print(headings)`
I just don't know how to get the worded part of those headings or the paragraph of text beneath them.
EDIT - Here is the text:
I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005
60601-1 © IEC:2005
– 337 –
– 169 –
12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
12.4.6 Diagnostic or therapeutic acoustic pressure
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with diagnostic or therapeutic acoustic pressure.
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS
General
13.1.1
When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a
time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the
ME EQUIPMENT.
The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is
described in 4.7.
Emissions, deformation of ENCLOSURE or exceeding maximum temperature
13.1.2
The following HAZARDOUS SITUATIONS shall not occur:
– emission of flames, molten metal, poisonous or ignitable substance in hazardous
quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired;
–
temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when
measured as described in 11.1.3;
temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be
touched, exceeding the allowable values in Table 23 when measured and adjusted as
described in 11.1.3;
–
– exceeding the allowable values for “other components and materials” identified in Table 22
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31.
In all other cases, the allowable values of Table 22 apply.
Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of
flames, molten metal or ignitable substances, shall not be applied to parts and components
where:
– The construction or the supply circuit limits the power dissipation in SINGLE FAULT
CONDITION to less than 15 W or the energy dissipation to less than 900 J.
You could use your pattern and match a space after it followed by the rest of the line.
Then repeat matching all following lines that do not start with a heading.
^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*
^\d+(?:.\d+)* Your pattern to match a heading followed by a space
.* Match any char except a newline 0+ times
(?: Non capturing group
\r?\n Match a newline
(?! Negative lookahead, assert what is directly to the right is not
\d+(?:.\d+)* The heading pattern
) Close lookahead
.* Match any char except a newline 0+ times
)* Close the non capturing group and repeat 0+ times to match all the lines
Regex demo
Maybe,
^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)
might be somewhat close to get those desired texts that I'm guessing.
Here we'd simply look for lines that'd start with,
^(\d+(?:\.\d+)*)\s+
then, we'd simply collect anything afterwards using
([\s\S]*?)
upto the next line that'd start with,
(?=^\d+(?:\.\d+)*)
Then, we may or may not, depending on how our input may look like, have only one last element left, which we would collect that using this last:
^(\d+(?:\.\d+)*)\s+([\s\S]*)
which we would then alter (using |) to the prior expression.
Even though, this method is simple to code, it's pretty slow performance-wise since we're using lookarounds, so the other answer here is much better, if time complexity would be a concern, which is likely to be.
Demo 1
Test
import re
regex = r"^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)"
string = """
I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005
60601-1 © IEC:2005
– 337 –
– 169 –
12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
12.4.6 Diagnostic or therapeutic acoustic pressure
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with diagnostic or therapeutic acoustic pressure.
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS
* General
13.1.1
When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a
time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the
ME EQUIPMENT.
The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is
described in 4.7.
* Emissions, deformation of ENCLOSURE or exceeding maximum temperature
13.1.2
The following HAZARDOUS SITUATIONS shall not occur:
– emission of flames, molten metal, poisonous or ignitable substance in hazardous
quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired;
–
temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when
measured as described in 11.1.3;
temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be
touched, exceeding the allowable values in Table 23 when measured and adjusted as
described in 11.1.3;
–
– exceeding the allowable values for “other components and materials” identified in Table 22
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31.
In all other cases, the allowable values of Table 22 apply.
Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of
flames, molten metal or ignitable substances, shall not be applied to parts and components
where:
– The construction or the supply circuit limits the power dissipation in SINGLE FAULT
CONDITION to less than 15 W or the energy dissipation to less than 900 J.
"""
print(re.findall(regex, string, re.M))
Output
[('12.4.5.4', 'Other ME EQUIPMENT producing diagnostic or therapeutic
radiation \nWhen applicable, the MANUFACTURER shall address in
the RISK MANAGEMENT PROCESS the \nRISKS associated with ME
EQUIPMENT producing diagnostic or therapeutic radiation other than
\nfor diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
\n\nCompliance is checked by inspection of the RISK MANAGEMENT
FILE.\n\n', '', ''), ('12.4.6', 'Diagnostic or therapeutic acoustic
pressure \nWhen applicable, the MANUFACTURER shall address in
the RISK MANAGEMENT PROCESS the \nRISKS associated with diagnostic
or therapeutic acoustic pressure. \n\nCompliance is checked by
inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('13', '*
HAZARDOUS SITUATIONS and fault conditions\n\n', '', ''), ('13.1',
'Specific HAZARDOUS SITUATIONS\n\n* General \n\n', '', ''),
('13.1.1', 'When applying the SINGLE FAULT CONDITIONS as
described in 4.7 and listed in 13.2, one at a \ntime, none
of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive)
shall occur in the \nME EQUIPMENT.\n\nThe failure of any one
component at a time, which could result in a HAZARDOUS SITUATION, is
\ndescribed in 4.7. \n\n* Emissions, deformation of ENCLOSURE or
exceeding maximum temperature \n\n', '', ''), ('', '', '13.1.2', 'The
following HAZARDOUS SITUATIONS shall not occur: \n– emission of
flames, molten metal, poisonous or ignitable substance in
hazardous \n\nquantities; \n\n– deformation of ENCLOSURES to such an
extent that compliance with 15.3.1 is impaired; \n– \n\ntemperatures
of APPLIED PARTS exceeding the allowed values identified in
Table 24 when \nmeasured as described in 11.1.3; \ntemperatures of
ME EQUIPMENT parts that are not APPLIED PARTS but are likely
to be \ntouched, exceeding the allowable values in Table 23
when measured and adjusted as \ndescribed in 11.1.3; \n\n– \n\n–
exceeding the allowable values for “other components and materials”
identified in Table 22 \ntimes 1,5 minus 12,5 °C. Limits for windings
are found in Table 26, Table 27 and Table 31. \nIn all other cases,
the allowable values of Table 22 apply. \n\nTemperatures shall be
measured using the method described in 11.1.3. \n\nThe SINGLE FAULT
CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to
the emission of \nflames, molten metal or ignitable substances,
shall not be applied to parts and components \nwhere: \n– The
construction or the supply circuit limits the power
dissipation in SINGLE FAULT \n\nCONDITION to less than 15 W or the
energy dissipation to less than 900 J. \n\n')]
Thanks to their detailed answers and helpful explanations I ended up combining parts of both #The-fourth-bird's code and #Emma's code into this regex which seems to work nicely for what I need.
(^\d+(?:\.\d+)*\s+)((?![a-z])[\s\S].*(?:\r?\n))([\s\S]*?)(?=^\d+(?:\.\d+)*\s+(?![a-z]))
Here is the REGEX DEMO.
I does what I want, which is splitting the (numerical heading), (worded heading) and the (body of text) into groups separated by commas which allow me to separate them into columns in Excel by using the custom delimiter ), ( and some other post processing.
The nice thing about this new regex is that it skips numbered headings that are just references and not actually headings as seen here:
import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extract" in i[:50]:
print(i)
This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.
Here is my pattern:
pattern_1a = re.compile(r"(?:```|\n)Item *1A\.?.{0,50}Risk Factors.*?(?:\n)Item *1B(?!u)", flags = re.I|re.S)
Why it does not match text like the following? What's wrong?
"""
Item 1A.
Risk
Factors
If we
are unable to commercialize
ADVEXIN
therapy in various markets for multiple indications,
particularly for the treatment of recurrent head and neck
cancer, our business will be harmed.
under which we may perform research and development services for
them in the future.
42
Table of Contents
We believe the foregoing transactions with insiders were and are
in our best interests and the best interests of our
stockholders. However, the transactions may cause conflicts of
interest with respect to those insiders.
Item 1B.
"""
Here is one solution that will math with your actual text. Put ( ) around your string it will solve a lot of issue. See the solution below.
pattern_1a = re.compile(r"(?:```|\n)(Item 1A)[.\n]{0,50}(Risk Factors)([\n]|.)*(\nItem 1B.)(?!u)", flags = re.I|re.S)
Match evidence:
https://regexr.com/41ejq
The problem is Risk Factors is spread over two lines. It is actually: Risk\nFactors
Using a general white space \s or a new line \n instead of a space matches the text.
simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?
I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/
I have two csv's. One with a large chunk of text and the other with annotations/strings. I want to find the position of the annotation in the text. The problem is some of the annotations have extra space/characters that are not in the text. I can not trim white space/ characters from the original text since I need the exact position. I started out using regex but it seems there is no way to search for partial matches.
Example
text = ' K. Meney & L. Pantelic, Int. J. Sus. Dev. Plann. Vol. 10, No. 4 (2015) 544?561\n? 2015 WIT Press, www.witpress.com\nISSN: 1743-7601 (paper format), ISSN: 1743-761X (online), http://www.witpress.com/journals\nDOI: 10.2495/SDP-V10-N4-544-561\nNOVEL DECISION MODEL FOR DELIVERING SUSTAINABLE \nINFRASTRUCTURE SOLUTIONS ? AN AUSTRALIAN \nCASE STUDY\nK. MENEY & L. PANTELIC\nSyrinx Environmental PL, Australia.\nABSTRACT\nConventional approaches to water supply and wastewater treatment in regional towns globally are failing \ndue to population growth and resource pressure, combined with prohibitive costs of infrastructure upgrades. '
seg = 'water supply and wastewater ¿treatment'
m = re.search(seg, text, re.M | re.DOTALL | re.I)
this matchs on about 15% segs
m = re.match(r'(water).*(treatment)$', text, re.M)
this did not work, I thought it would be possible to match on the first and last words and get their positions but this has numerous problems such as multiple occurrences of 'water'
with open(file_path) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
if s.find(seg) != -1:
print('true')
I had no luck with this at all for some reason.
Am I on the right path with any of these or is there a better way to do this?
Extra Example
From Text
The SIDM? model was applied to a rapidly grow-\ning Australian township (Hopetoun)
From Seg
The SIDM model was applied to a rapidly grow-ing Australian township (Hopetoun)
From Text
\nSIDM? is intended to be used both as a design and evaluation tool. As a design tool, it i) guides \nthe design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the \nlevel of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-\ntially provides the scope of work required to advance the design process. As an evaluation tool it can \nact both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally \nacceptable, and as a detailed evaluation tool where various options can be compared in detail in \norder to establish a preferred solution.
From Seg
SIDM is intended to be used both as a design and evaluation tool. As a design tool, it i) guides the design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the level of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-tially provides the scope of work required to advance the design process. As an evaluation tool it can act both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally acceptable, and as a detailed evaluation tool where various options can be compared in detail in order to establish a preferred solution.
List of subs to segment prior to matching:
seg = re.sub(r'\(', r'\\(', seg ) #Need to escape paraenthesis due to regex
seg = re.sub(r'\)', r'\\)', seg )
seg = re.sub(r'\?', r' ', seg )
seg = re.sub(r'[^\x00-\x7F]+',' ', seg)
seg = re.sub(r'\s+', ' ', seg)
seg = re.sub(r'\\r', ' ', seg)
As casimirethippolyte pointed out, patseg = re.sub(r'\W+', '\W+', seg) solved the problem for me.
I'm trying to extract the names of firms from the text.
I found out that firm's names starts with Capital letters and some of them contains ' and ' or ' de ' or ' & ' or 'of' inside it.
So I wrote the regular expression that catches them
: (?:[A-Z]+[\w'-]*\s?(?:&\s|and\s|de\s|of\s)?)+%?
For example, from the sentence
"The companys largest customer, Wal-Mart Stores, Inc. and its
affiliated companies, accounted for approximately 25% of net sales
during fiscal year 2009 and 24% during fiscal years 2008 and 2007."
This regex matches out
"The", "Wal-Mart Stores", "Inc"
However, I am stuck with two problems.
Problem 1:
I found out that company's segment, product, division, category, sales names are also matched since It also begins with capitals. However, I don't want to extract those names along with companies names.
Problem 2 :
I don't want to get names which starts with S(s)ale(s) of/by/in or sold
For example,
;;;;;In fiscal 2005, the Company derived
approximately 21% ($4,782,852) of its consolidated revenues from
continuing operations from direct transactions with Kmart Corporation.
Sales of Computer products are important for us. However, Computer's Parts and
Display Segment sale has been decreasing.
According to my regex wrote above, it extracts
['In', "Company', 'Kmart Corporation', 'Sales of Computer', "Computer's Parts and Display Segment"]
Since, I don't want to get 'Sales of Computer' and 'Computer's Parts and Display Segment'
I tried to use negative look ahead / look behind
Bellows are what I've been trying so far:
I added negative look ahead ((?![Ss]egments?|[Pp]roducts?|programs?|[Dd]ivisions?|[Cc]ategor(?:y|ies)|[Ss]ales?))
(?:[A-Z]+[\w'-]\s?(?:&\s|and\s|de\s|of\s)?)+(?![Ss]egments?|[Pp]roducts?|programs?|[Dd]ivisions?|[Cc]ategor(?:y|ies)|[Ss]ales?)*
However, It still matches "Computer's Parts and Display Segment"...!
negative look behind is even worse...
I added (? at the beginning of my regex.
However, It seems like negative look behind expression cannot contain grouping or | ...
Whit such a huge frustration, I wrote few more regex for each cases and used set operations to deal with this problem.
However, I wonder is there any single regex that can do exactly what I expect in a one - shot??
Thanks for reading!