I need to extract headings and the chunk of text beneath them from a text file in Python using regular expression but I'm finding it difficult.
I converted this PDF to text so that it now looks like this:
So far I have been able to get all the numerical headers (12.4.5.4, 12.4.5.6, 13, 13.1, 13.1.1, 13.1.12) using the following regex:
import re
with open('data/single.txt', encoding='UTF-8') as file:
for line in file:
headings = re.findall(r'^\d+(?:\.\d+)*\.?', line)
print(headings)`
I just don't know how to get the worded part of those headings or the paragraph of text beneath them.
EDIT - Here is the text:
I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005
60601-1 © IEC:2005
– 337 –
– 169 –
12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
12.4.6 Diagnostic or therapeutic acoustic pressure
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with diagnostic or therapeutic acoustic pressure.
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS
General
13.1.1
When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a
time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the
ME EQUIPMENT.
The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is
described in 4.7.
Emissions, deformation of ENCLOSURE or exceeding maximum temperature
13.1.2
The following HAZARDOUS SITUATIONS shall not occur:
– emission of flames, molten metal, poisonous or ignitable substance in hazardous
quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired;
–
temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when
measured as described in 11.1.3;
temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be
touched, exceeding the allowable values in Table 23 when measured and adjusted as
described in 11.1.3;
–
– exceeding the allowable values for “other components and materials” identified in Table 22
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31.
In all other cases, the allowable values of Table 22 apply.
Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of
flames, molten metal or ignitable substances, shall not be applied to parts and components
where:
– The construction or the supply circuit limits the power dissipation in SINGLE FAULT
CONDITION to less than 15 W or the energy dissipation to less than 900 J.
You could use your pattern and match a space after it followed by the rest of the line.
Then repeat matching all following lines that do not start with a heading.
^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*
^\d+(?:.\d+)* Your pattern to match a heading followed by a space
.* Match any char except a newline 0+ times
(?: Non capturing group
\r?\n Match a newline
(?! Negative lookahead, assert what is directly to the right is not
\d+(?:.\d+)* The heading pattern
) Close lookahead
.* Match any char except a newline 0+ times
)* Close the non capturing group and repeat 0+ times to match all the lines
Regex demo
Maybe,
^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)
might be somewhat close to get those desired texts that I'm guessing.
Here we'd simply look for lines that'd start with,
^(\d+(?:\.\d+)*)\s+
then, we'd simply collect anything afterwards using
([\s\S]*?)
upto the next line that'd start with,
(?=^\d+(?:\.\d+)*)
Then, we may or may not, depending on how our input may look like, have only one last element left, which we would collect that using this last:
^(\d+(?:\.\d+)*)\s+([\s\S]*)
which we would then alter (using |) to the prior expression.
Even though, this method is simple to code, it's pretty slow performance-wise since we're using lookarounds, so the other answer here is much better, if time complexity would be a concern, which is likely to be.
Demo 1
Test
import re
regex = r"^(\d+(?:\.\d+)*)\s+([\s\S]*?)(?=^\d+(?:\.\d+)*)|^(\d+(?:\.\d+)*)\s+([\s\S]*)"
string = """
I.S. EN 60601-1:2006&A1:2013&AC:2014&A12:2014
60601-1 © IEC:2005
60601-1 © IEC:2005
– 337 –
– 169 –
12.4.5.4 Other ME EQUIPMENT producing diagnostic or therapeutic radiation
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with ME EQUIPMENT producing diagnostic or therapeutic radiation other than
for diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
12.4.6 Diagnostic or therapeutic acoustic pressure
When applicable, the MANUFACTURER shall address in the RISK MANAGEMENT PROCESS the
RISKS associated with diagnostic or therapeutic acoustic pressure.
Compliance is checked by inspection of the RISK MANAGEMENT FILE.
13 * HAZARDOUS SITUATIONS and fault conditions
13.1 Specific HAZARDOUS SITUATIONS
* General
13.1.1
When applying the SINGLE FAULT CONDITIONS as described in 4.7 and listed in 13.2, one at a
time, none of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive) shall occur in the
ME EQUIPMENT.
The failure of any one component at a time, which could result in a HAZARDOUS SITUATION, is
described in 4.7.
* Emissions, deformation of ENCLOSURE or exceeding maximum temperature
13.1.2
The following HAZARDOUS SITUATIONS shall not occur:
– emission of flames, molten metal, poisonous or ignitable substance in hazardous
quantities;
– deformation of ENCLOSURES to such an extent that compliance with 15.3.1 is impaired;
–
temperatures of APPLIED PARTS exceeding the allowed values identified in Table 24 when
measured as described in 11.1.3;
temperatures of ME EQUIPMENT parts that are not APPLIED PARTS but are likely to be
touched, exceeding the allowable values in Table 23 when measured and adjusted as
described in 11.1.3;
–
– exceeding the allowable values for “other components and materials” identified in Table 22
times 1,5 minus 12,5 °C. Limits for windings are found in Table 26, Table 27 and Table 31.
In all other cases, the allowable values of Table 22 apply.
Temperatures shall be measured using the method described in 11.1.3.
The SINGLE FAULT CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to the emission of
flames, molten metal or ignitable substances, shall not be applied to parts and components
where:
– The construction or the supply circuit limits the power dissipation in SINGLE FAULT
CONDITION to less than 15 W or the energy dissipation to less than 900 J.
"""
print(re.findall(regex, string, re.M))
Output
[('12.4.5.4', 'Other ME EQUIPMENT producing diagnostic or therapeutic
radiation \nWhen applicable, the MANUFACTURER shall address in
the RISK MANAGEMENT PROCESS the \nRISKS associated with ME
EQUIPMENT producing diagnostic or therapeutic radiation other than
\nfor diagnostic X-rays and radiotherapy (see 12.4.5.2 and 12.4.5.3).
\n\nCompliance is checked by inspection of the RISK MANAGEMENT
FILE.\n\n', '', ''), ('12.4.6', 'Diagnostic or therapeutic acoustic
pressure \nWhen applicable, the MANUFACTURER shall address in
the RISK MANAGEMENT PROCESS the \nRISKS associated with diagnostic
or therapeutic acoustic pressure. \n\nCompliance is checked by
inspection of the RISK MANAGEMENT FILE.\n\n', '', ''), ('13', '*
HAZARDOUS SITUATIONS and fault conditions\n\n', '', ''), ('13.1',
'Specific HAZARDOUS SITUATIONS\n\n* General \n\n', '', ''),
('13.1.1', 'When applying the SINGLE FAULT CONDITIONS as
described in 4.7 and listed in 13.2, one at a \ntime, none
of the HAZARDOUS SITUATIONS in 13.1.2 to 13.1.4 (inclusive)
shall occur in the \nME EQUIPMENT.\n\nThe failure of any one
component at a time, which could result in a HAZARDOUS SITUATION, is
\ndescribed in 4.7. \n\n* Emissions, deformation of ENCLOSURE or
exceeding maximum temperature \n\n', '', ''), ('', '', '13.1.2', 'The
following HAZARDOUS SITUATIONS shall not occur: \n– emission of
flames, molten metal, poisonous or ignitable substance in
hazardous \n\nquantities; \n\n– deformation of ENCLOSURES to such an
extent that compliance with 15.3.1 is impaired; \n– \n\ntemperatures
of APPLIED PARTS exceeding the allowed values identified in
Table 24 when \nmeasured as described in 11.1.3; \ntemperatures of
ME EQUIPMENT parts that are not APPLIED PARTS but are likely
to be \ntouched, exceeding the allowable values in Table 23
when measured and adjusted as \ndescribed in 11.1.3; \n\n– \n\n–
exceeding the allowable values for “other components and materials”
identified in Table 22 \ntimes 1,5 minus 12,5 °C. Limits for windings
are found in Table 26, Table 27 and Table 31. \nIn all other cases,
the allowable values of Table 22 apply. \n\nTemperatures shall be
measured using the method described in 11.1.3. \n\nThe SINGLE FAULT
CONDITIONS in 4.7, 8.1 b), 8.7.2 and 13.2.2, with regard to
the emission of \nflames, molten metal or ignitable substances,
shall not be applied to parts and components \nwhere: \n– The
construction or the supply circuit limits the power
dissipation in SINGLE FAULT \n\nCONDITION to less than 15 W or the
energy dissipation to less than 900 J. \n\n')]
Thanks to their detailed answers and helpful explanations I ended up combining parts of both #The-fourth-bird's code and #Emma's code into this regex which seems to work nicely for what I need.
(^\d+(?:\.\d+)*\s+)((?![a-z])[\s\S].*(?:\r?\n))([\s\S]*?)(?=^\d+(?:\.\d+)*\s+(?![a-z]))
Here is the REGEX DEMO.
I does what I want, which is splitting the (numerical heading), (worded heading) and the (body of text) into groups separated by commas which allow me to separate them into columns in Excel by using the custom delimiter ), ( and some other post processing.
The nice thing about this new regex is that it skips numbered headings that are just references and not actually headings as seen here:
import pdfplumber
import re
pdfToString = ""
with pdfplumber.open(r"sample.pdf") as pdf:
for page in pdf.pages:
print(page.extract_text())
pdfToString += page.extract_text()
matches = re.findall(r'^\d+(?:\.\d+)* .*(?:\r?\n(?!\d+(?:\.\d+)* ).*)*',pdfToString, re.M)
for i in matches:
if "word_to_extract" in i[:50]:
print(i)
This solution is to extract all the headings which has same format of headings in the question and to extract the required heading and the paragraphs that follows it.
Related
I am trying to make a regex pattern to grab part of a string, the file contains certain headers, and all of the headers have the same format. I'm currently using python, and would like to keep it that way.
Here is an example file that I came across:
TI TEST TEST TEST TEST TEST TEST TEST TEST AJSAOISJAO SOAI
ASASPAOS
SO EITCHA EITCHA EITCHA EITCHA EITCHA EITCHA EITCHA EITCHA
AB Purpose
To examine the evidence supporting the use of simulation-based assessments as surrogates for patient-related outcomes assessed in the workplace.
Method
The authors systematically searched MEDLINE, EMBASE, Scopus, and key journals through February 26, 2013. They included original studies that assessed health professionals and trainees using simulation and then linked those scores with patient-related outcomes assessed in the workplace. Two reviewers independently extracted information on participants, tasks, validity evidence, study quality, patent-related and simulation-based outcomes, and magnitude of correlation. All correlations were pooled using random-effects meta-analysis.
Results
Of 11,628 potentially relevant articles, the 33 included studies enrolled 1,203 participants, including postgraduate physicians (n = 24 studies), practicing physicians (n = 8), medical students (n = 6), dentists (n = 2), and nurses (n = 1). The pooled correlation for provider behaviors was 0.51 (95% confidence interval [Cl], 0.38 to 0.62; n = 27 studies); for time behaviors, 0.44 (95% Cl, 0.15 to 0.66; n = 7); and for patient outcomes, 0.24(95% Cl, 0.02 to 0.47; n = 5). Most reported validity evidence was favorable, though studies often included only correlational evidence. Validity evidence of internal structure (n = 13 studies), content (n = 12), response process (n = 2), and consequences (n = 1) were reported less often. Three tools showed large pooled correlations and favorable (albeit incomplete) validity evidence.
Conclusions
Simulation-based assessments often correlate positively with patient-related outcomes. Although these surrogates are imperfect, tools with established validity evidence may replace workplace-based assessments for evaluating select procedural skills.
OI MANEIRAO MANEIRAOMANEIRAOMANEIRAO MANEIRAO
SN 6516516516
EI 849819981981
PD FEB
PY 2015
My current objective is to capture the entire text of the 'AB' header. It is good to note that the length and format of the contents of AB doesn't change that much, its prety much always paragraphs, or a line of text until the next header.
I've tried a bunch of different regexes patterns, the one that got me closer to what I want is:
\nAB ((.*?\n)+)(\n[A-Z]{2}\s)?
However it goes until the end of the file consuming every header it finds, I would like for the pattern to stop matching after encountering the next header after AB, whatever it may be.
The headers follow a pattern of always a line break, after that two uppercase letters and a space, or:
\n[A-Z]{2}\s
Thanks to whomever helps in any way.
My question is different of the normal greedy signs because it is not ordered by a character being not greedy and yet an entire "stop" group.
Is this what you're looking for?
^AB ([\w\W]*?)(?=\n[A-Z]{2}\s)
Demo
(?=...) is for Positive Lookahead. It asserts that the given subpattern can be matched here, without consuming characters
Here is my pattern:
pattern_1a = re.compile(r"(?:```|\n)Item *1A\.?.{0,50}Risk Factors.*?(?:\n)Item *1B(?!u)", flags = re.I|re.S)
Why it does not match text like the following? What's wrong?
"""
Item 1A.
Risk
Factors
If we
are unable to commercialize
ADVEXIN
therapy in various markets for multiple indications,
particularly for the treatment of recurrent head and neck
cancer, our business will be harmed.
under which we may perform research and development services for
them in the future.
42
Table of Contents
We believe the foregoing transactions with insiders were and are
in our best interests and the best interests of our
stockholders. However, the transactions may cause conflicts of
interest with respect to those insiders.
Item 1B.
"""
Here is one solution that will math with your actual text. Put ( ) around your string it will solve a lot of issue. See the solution below.
pattern_1a = re.compile(r"(?:```|\n)(Item 1A)[.\n]{0,50}(Risk Factors)([\n]|.)*(\nItem 1B.)(?!u)", flags = re.I|re.S)
Match evidence:
https://regexr.com/41ejq
The problem is Risk Factors is spread over two lines. It is actually: Risk\nFactors
Using a general white space \s or a new line \n instead of a space matches the text.
i've got a CSV which contains article's text in different raws.
Like we have column 1:
Hello i am John
Tom has got a Dog
... more text.
I'm trying the extract the first names and surname from those text and i was able to do that if i copy and paste the single text in the code.
But i don't know how to read the csv in the code and then it has to processes the different texts in the raws extracting name and surname.
Here is my code working with the text in it:
import operator,collections,heapq
import csv
import pandas
import json
import nltk
from nameparser.parser import HumanName
def get_human_names(text):
tokens = nltk.tokenize.word_tokenize(text)
pos = nltk.pos_tag(tokens)
sentt = nltk.ne_chunk(pos, binary = False)
person_list = []
person = []
name = ""
for subtree in sentt.subtrees(filter=lambda t: t.label() == 'PERSON'):
for leaf in subtree.leaves():
person.append(leaf[0])
if len(person) > 1: #avoid grabbing lone surnames
for part in person:
name += part + ' '
if name[:-1] not in person_list:
person_list.append(name[:-1])
name = ''
person = []
return (person_list)
text = """
M.F. Husain, Untitled, 1973, oil on canvas, 182 x 122 cm. Courtesy the Pundole Family Collection
In her essay ‘Worlding Asia: A Conceptual Framework for the First Delhi Biennale’, Arshiya Lokhandwala explores Gayatri Spivak’s provocation of ‘worlding’, which has been defined as imperialism’s epistemic violence of inscribing meaning upon a colonized space to bring it into the world through a Eurocentric framework. Lokhandwala extends this concept of worlding to two anti-cartographical terms: ‘de-worlding’, rejecting or debunking categories that are no longer useful such as the binaries of East-West, North-South, Orient-Occidental, and ‘re-worlding’, re-inscribing new meanings into the spaces that have been de-worlded to create one’s own worlds. She offers de-worlding and re-worlding as strategies for active resistance against epistemic violence of all forms, including those that stem from ‘colonialist strategies of imperialism’ or from ‘globalization disguised within neo-imperialist practices’.
Lokhandwala writes: Fourth World. The presence of Arshiya is really the main thing here.
Re-worlding allows us to reach a space of unease performing the uncanny, thereby locating both the object of art and the postcolonial subject in the liminal space, which prevents these categorizations as such… It allows an introspected view of ourselves and makes us seek our own connections, and look at ourselves through our own eyes.
In a recent exhibition on the occasion of the seventieth anniversary of India’s Independence, Lokhandwala employed the term to seemingly interrogate this proposition: what does it mean to re-world a country through the agonistic intervention of art and activism? What does it mean for a country and its historiography to re-world? What does this re-worlded India, in active resistance and a state of introspection, look like to itself?
The exhibition ‘India Re-Worlded: Seventy Years of Investigating a Nation’ at Gallery Odyssey in Mumbai (11 September 2017–21 February 2018) invited artists to select a year from the seventy years since the country’s independence that had personal import or resonated with them because of the significance of the events that occurred at the time. The show featured works that responded to or engaged with these chosen years. It captured a unique history of post-independent India told through the perspective of seventy artists. The works came together to collectively reflect on the history and persistence of violence from pre-independence to the present day and made reference to the continued struggle for political agency through acts of resistance, artistic and otherwise. Through the inclusion of subaltern voices, imagined geographies, particular experiences, solidarities and critical dissent, the exhibition offered counter-narratives and multiple histories.
Anita Dube, Missing Since 1992, 2017, wood, electrical wire, holders, bulbs, voltage stabilizers, 223 x 223 cm. Courtesy the artist and Gallery Odyssey
Lokhandwala says she had been thinking hard about an appropriate response to the seventy years of independence. ‘I wanted to present a new curatorial paradigm, a postcolonial critique of the colonisation and an affirmation of India coming into her own’, she says. ‘I think the fact that I tried to include seventy artists to [each take up] one year in the lifetime of the nation was also a challenging task to take on curatorially.’
Her previous undertaking ‘After Midnight: Indian Modernism to Contemporary India: 1947/1997’ at the Queens Museum in New York in 2015 juxtaposed two historical periods in Indian art: Indian modern art that emerged in the post-independence period from 1947 through the 1970s, and contemporary art from 1997 onwards when the country experienced the effects of economic liberalization and globalization. The 'India Re-Worlded' exhibition similarly presented art practices that emerged from the framework of postcolonial Indian modernity. It attempted to explore the self-reflexivity of the Indian artist as a postcolonial subject and, as Lokhandwala described in the curatorial note, the artists’ resulting ‘sense of agency and renewed connection with the world at large’. The exhibition included works by Progressive Artists' Group core members F.N. Souza, S.H. Raza, M.F. Husain and their peers Krishen Khanna, Tyeb Mehta and V.S. Gaitonde, presented under the year in which they were produced. Other important and pioneering pieces included work from Somnath Hore’s paper pulp print series Wounds (1970); a blowtorch on plywood work by abstractionist Jeram Patel, who was one of the founding members of Group 1890 ; and a video documenting one of Rummana Husain’s last performances.
The methodology of their display removed the didactic, art historical preoccupation with chronology and classification, instead opting to intersperse them amongst contemporary works. This fits in with Lokhandwala’s curatorial impulses and vision: to disrupt and resist single narratives, to stage dialogues and interactions between the works, to offer overlaps, intersections and nuances in the stories, but also in the artistic impetuses.
Jeram Patel, Untitled, 1970, blowtorch Fourht World on plywood, 61 x 61 cm. Courtesy the artist and Gallery Odyssey
The show opened with Jitish Kallat’s Death of Distance (2006), then we have Arshiya, which through lenticular prints presented two overlaid found texts from 2005 and 2006. One was a harrowing news story of a twelve-year-old Indian girl committing suicide after her mother tells her she cannot afford one rupee – two US cents – for a school meal. The other one was a news clipping in which the head of the state-run telecommunications company announces a new one-rupee-per-minute tariff plan for interstate phone calls and declares the scheme as ‘the death of distance’. The images offer two realities that are distant from and at odds with each other. They highlight an economic disparity heightened by globalization. A rupee coin, enlarged to a human scale and covered in black lead, stood poised on the gallery floor in front of the prints.
Bose Krishnamachari chose 1962, the year of his birth, to discuss the relationship between memory and age. As a visual representation of the country’s past through a timeline, within which he situated his own identity-questioning experiences as an artist, his work epitomized the themes and intentions of the exhibition. In Shilpa Gupta’s single channel video projection 100 Hand drawn Maps of India (2007–8) ordinary Indian people sketch outlines of the country from memory. The subjective maps based on the author’s impression and perception of space show how each person sees the country and articulates its borders. The work seems to ask, what do these incongruent representations reveal about our collective identities and our ideas about nationhood?
The repetition of some of the years selected, or even the absence of certain years, suggested that the parameters set by the curatorial concept sought to guide rather than clamp down on. This allowed greater freedom for the artists and curator, and therefore more considered and wide responses.
Surekha’s photographic series To Embrace (2017) celebrated the Chipko tree-hugging movement that originated on 25 March 1974, when 27 women from Reni village in Uttar Pradesh in northern India staged a self-organised, non-violent resistance to the felling of trees by clinging to them and linking arms around them. The photographs showed women embracing the branches of the giant, 400-year-old Dodda Alada Mara (Big Banyan Tree) in rural Bengaluru – paying a homage to both the pioneering eco-feminist environmental movement and the grand old tree.
Anita Dube’s Missing Since 1992 (2017) hung from the ceiling like a ghost of a terrible, dark past. Its electrical wires and bulbs outlined a sombre dome to represent the demolition of the Babri Masjid on 6 December 1992, which Dube calls ‘the darkest day I have experienced as a citizen’. This piece was one of several works in the exhibition that dealt with this event and the many episodes of communal riots that followed. These works document a decade when the country witnessed economic reform and growth but also the rise of a religious right-wing.
Riyas Komu, Fourth World, 2017, rubber and metal, 244 x 45 cm each. Courtesy the artist and Gallery Odyssey
Near the end of the exhibition, Riyas Komu’s sculptural installation Fourth World (2017) alerted us to the divisive forces that are threatening to dismantle the ethical foundations of the Republic symbolized by its official emblem, the Lion Capital – a symbol seen also on the blackened rupee coin featured in Kallat’s work – and in a way rounded off the viewing experience.
The seventy works that attempted to represent seventy years of the country’s history built a dense and complicated network of voices and stories, and also formed a cross section of the art emerging during this period. Although the show’s juxtaposition of modern and contemporary art made it seem like an extension of the themes presented in the curator’s previous exhibition at the Queens Museum, here the curatorial concept made the process of staging the exhibition more democratic blurring the sequence of modern and contemporary Indian art. Furthermore, the multi-pronged curatorial intentions brought renewed criticality to the events of past and present, always underscoring the spirit of resistance and renegotiation as the viewer could actively de-world and re-world.
"""
names = get_human_names(text)
print ("LAST, FIRST")
namex=[]
for name in names:
last_first = HumanName(name).last + ' ' + HumanName(name).first
print (last_first)
namex.append(last_first)
print (namex)
print('Saving the data to the json file named Names')
try:
with open('Names.json', 'w') as outfile:
json.dump(namex, outfile)
except Exception as e:
print(e)
So i would like to remove all the text from the code and want the code to process the text from my csv.
Thanks a lot :)
CSV stands for Comma Separated Values and is a text format used to represent tabular data in plain text. Commas are used as column separators and line breaks as row separators. Your string does not look like a real csv file. Nevermind the extension you can still read your text file like this:
with open('your_file.csv', 'r') as f:
my_text = f.read()
Your text file is now available as my_text in the rest of your code.
Pandas has read_csv command:
yourText= pandas.read_csv("csvFile.csv")
I have two csv's. One with a large chunk of text and the other with annotations/strings. I want to find the position of the annotation in the text. The problem is some of the annotations have extra space/characters that are not in the text. I can not trim white space/ characters from the original text since I need the exact position. I started out using regex but it seems there is no way to search for partial matches.
Example
text = ' K. Meney & L. Pantelic, Int. J. Sus. Dev. Plann. Vol. 10, No. 4 (2015) 544?561\n? 2015 WIT Press, www.witpress.com\nISSN: 1743-7601 (paper format), ISSN: 1743-761X (online), http://www.witpress.com/journals\nDOI: 10.2495/SDP-V10-N4-544-561\nNOVEL DECISION MODEL FOR DELIVERING SUSTAINABLE \nINFRASTRUCTURE SOLUTIONS ? AN AUSTRALIAN \nCASE STUDY\nK. MENEY & L. PANTELIC\nSyrinx Environmental PL, Australia.\nABSTRACT\nConventional approaches to water supply and wastewater treatment in regional towns globally are failing \ndue to population growth and resource pressure, combined with prohibitive costs of infrastructure upgrades. '
seg = 'water supply and wastewater ¿treatment'
m = re.search(seg, text, re.M | re.DOTALL | re.I)
this matchs on about 15% segs
m = re.match(r'(water).*(treatment)$', text, re.M)
this did not work, I thought it would be possible to match on the first and last words and get their positions but this has numerous problems such as multiple occurrences of 'water'
with open(file_path) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
if s.find(seg) != -1:
print('true')
I had no luck with this at all for some reason.
Am I on the right path with any of these or is there a better way to do this?
Extra Example
From Text
The SIDM? model was applied to a rapidly grow-\ning Australian township (Hopetoun)
From Seg
The SIDM model was applied to a rapidly grow-ing Australian township (Hopetoun)
From Text
\nSIDM? is intended to be used both as a design and evaluation tool. As a design tool, it i) guides \nthe design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the \nlevel of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-\ntially provides the scope of work required to advance the design process. As an evaluation tool it can \nact both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally \nacceptable, and as a detailed evaluation tool where various options can be compared in detail in \norder to establish a preferred solution.
From Seg
SIDM is intended to be used both as a design and evaluation tool. As a design tool, it i) guides the design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the level of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-tially provides the scope of work required to advance the design process. As an evaluation tool it can act both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally acceptable, and as a detailed evaluation tool where various options can be compared in detail in order to establish a preferred solution.
List of subs to segment prior to matching:
seg = re.sub(r'\(', r'\\(', seg ) #Need to escape paraenthesis due to regex
seg = re.sub(r'\)', r'\\)', seg )
seg = re.sub(r'\?', r' ', seg )
seg = re.sub(r'[^\x00-\x7F]+',' ', seg)
seg = re.sub(r'\s+', ' ', seg)
seg = re.sub(r'\\r', ' ', seg)
As casimirethippolyte pointed out, patseg = re.sub(r'\W+', '\W+', seg) solved the problem for me.
I tried this code:
import re
re.sub('\r\n\r\n','','Summary_csv.csv')
It did not do anything. As in, it did not even touch the file (there is no modification to the date and time of the file after running this code). Could anyone please explain why?
Then I tried this:
import re
output = open("Summary.csv","w", encoding="utf8")
input = open("Summary_csv.csv", encoding="utf8")
for line in input:
output.write(re.sub('\r\n\r\n','', line))
input.close()
output.close()
This one does something to the file, as in the modified data and time in the file changes after I run this code, but it does not remove the consecutive newlines, and the output is the same as the original file.
EDIT: This a small sample from the original csv file:
"The UK’s Civil Aviation Authority (CAA) has announced new passenger charge caps for Heathrow and Gatwick while deregulating Stansted. Under the Civil Aviation Act 2012 for the economic regulation of UK airport operators, the CAA conducts market power assessments (MPA) to judge their power within the aviation market and whether they need to be regulated. (....) As expected, the CAA’s price review published on January 10 requires Heathrow and Gatwick to continue their regulated status, though Stansted has been de-regulated, giving operator MAG the power to determine what levies are necessary.
Although the CAA had previously said Heathrow would be allowed to increase its charges in line with inflation, Heathrow and Gatwick’s price rises will be limited to 1.5% below the rate of inflation from April 1. These rules will run until December 31, 2018, for Heathrow and until March 31, 2021 for Gatwick. (....) CAA's Chair, Dame Deidre Hutton commented: “[Passengers] will see prices fall, whilst still being able to look forward to high service standards, thanks to a robust licensing regime.” Heathrow has stated the CAA’s price caps will result in its per passenger airline charges falling in real terms from £20.71 in 2013/14 to £19.10 in 2018/19. (....)
"
"The CAPA Airport Construction and Capex database presently has over USD385 billion of projects indicated globally, led by Asia with just over USD115 billion of projects either in progress or planned for and with a good chance of completion. China, with 69 regional airports to be constructed by 2015, is the most active, adding to the existing 193. But some Asian countries, notably India and Indonesia, each with extended near-or more than double digit growth, are lagging badly in introducing new infrastructure.
The Middle East is also undertaking major investment, notably in the Gulf airports, as the world-changing operations of its main airlines continue to expand rapidly. But Saudi Arabia and Oman are also embarked on major expansions.
Istanbul's new airport starts to take shape in 2014, with completion of the world's biggest facility due to be completed by 2019. Meanwhile, in Brazil, the race is on to have sufficient capacity in place for the football world cup, due to commence in Jun-2014. (....)
"
I want the output to be the following:
"The UK’s Civil Aviation Authority (CAA) has announced new passenger charge caps for Heathrow and Gatwick while deregulating Stansted. Under the Civil Aviation Act 2012 for the economic regulation of UK airport operators, the CAA conducts market power assessments (MPA) to judge their power within the aviation market and whether they need to be regulated. (....) As expected, the CAA’s price review published on January 10 requires Heathrow and Gatwick to continue their regulated status, though Stansted has been de-regulated, giving operator MAG the power to determine what levies are necessary. Although the CAA had previously said Heathrow would be allowed to increase its charges in line with inflation, Heathrow and Gatwick’s price rises will be limited to 1.5% below the rate of inflation from April 1. These rules will run until December 31, 2018, for Heathrow and until March 31, 2021 for Gatwick. (....) CAA's Chair, Dame Deidre Hutton commented: “[Passengers] will see prices fall, whilst still being able to look forward to high service standards, thanks to a robust licensing regime.” Heathrow has stated the CAA’s price caps will result in its per passenger airline charges falling in real terms from £20.71 in 2013/14 to £19.10 in 2018/19. (....)"
"The CAPA Airport Construction and Capex database presently has over USD385 billion of projects indicated globally, led by Asia with just over USD115 billion of projects either in progress or planned for and with a good chance of completion. China, with 69 regional airports to be constructed by 2015, is the most active, adding to the existing 193. But some Asian countries, notably India and Indonesia, each with extended near-or more than double digit growth, are lagging badly in introducing new infrastructure.The Middle East is also undertaking major investment, notably in the Gulf airports, as the world-changing operations of its main airlines continue to expand rapidly. But Saudi Arabia and Oman are also embarked on major expansions.Istanbul's new airport starts to take shape in 2014, with completion of the world's biggest facility due to be completed by 2019. Meanwhile, in Brazil, the race is on to have sufficient capacity in place for the football world cup, due to commence in Jun-2014. (....)"
The answer to your question is that re.sub is being applied to the string 'Summary_csv.csv' not the file. It expects a string for the third argument and it does the substitution on that string.
In the second piece of code, you open the file and read it one line at a time. This means that no line will ever contain two newlines. Two newlines will result in two consecutive lines being returned from the input file with the second line being empty.
To get rid of the extra new lines, just test for a blank line and don't write it to the output. Calling line.strip() on an empty line (one containing only whitespace characters) will return an empty string which will evaluate to False in an if statement. If line.strip() isn't empty, then write it to your output file.
output = open("Summary.csv","w", encoding="utf8")
infile = open("Summary_csv.csv", encoding="utf8")
for line in infile:
if line.strip():
output.write(line)
infile.close()
output.close()
Note: Python treats text files in a platform-independent way and converts line endings to '\n' by default, so testing for '\r\n' wouldn't work even without the other problems. If you really want the endings to be '\r\n', you must specify newline='\r\n' when you call open() for the input file. See the documentation on https://docs.python.org/3/library/functions.html#open for a full explanation.
Part II
With the example input and output files posted by the OP, it appears that the problem was more complex than stripping extra newlines. The following code reads the input file, finds text between pairs of " characters and combines all of the lines onto a single line in the output file. Extra newlines not inside " are sent to the output file unaltered.
import re
outfile = open("Summary.csv","w", encoding="utf8")
infile = open("Summary_csv.csv", encoding="utf8")
text = infile.read()
text = re.sub('\n\n', '\n', text) #remove double newlines
for p in re.split('(\".+?\")', text, flags=re.DOTALL):
if p: #skip empty matches
if p.strip(): #this is a paragraph of text and should be a line
p = p[1:-2] #get everything between the quotes
p = p.strip() #remove leading and trailing whitespace
p = re.sub('\n+', ' ', p) #replace any remaining \n with two spaces
p = '"' + p + '"\n' #replace the " around the paragraph and add newline
outfile.write(p)
infile.close()
outfile.close()