I have two csv's. One with a large chunk of text and the other with annotations/strings. I want to find the position of the annotation in the text. The problem is some of the annotations have extra space/characters that are not in the text. I can not trim white space/ characters from the original text since I need the exact position. I started out using regex but it seems there is no way to search for partial matches.
Example
text = ' K. Meney & L. Pantelic, Int. J. Sus. Dev. Plann. Vol. 10, No. 4 (2015) 544?561\n? 2015 WIT Press, www.witpress.com\nISSN: 1743-7601 (paper format), ISSN: 1743-761X (online), http://www.witpress.com/journals\nDOI: 10.2495/SDP-V10-N4-544-561\nNOVEL DECISION MODEL FOR DELIVERING SUSTAINABLE \nINFRASTRUCTURE SOLUTIONS ? AN AUSTRALIAN \nCASE STUDY\nK. MENEY & L. PANTELIC\nSyrinx Environmental PL, Australia.\nABSTRACT\nConventional approaches to water supply and wastewater treatment in regional towns globally are failing \ndue to population growth and resource pressure, combined with prohibitive costs of infrastructure upgrades. '
seg = 'water supply and wastewater ¿treatment'
m = re.search(seg, text, re.M | re.DOTALL | re.I)
this matchs on about 15% segs
m = re.match(r'(water).*(treatment)$', text, re.M)
this did not work, I thought it would be possible to match on the first and last words and get their positions but this has numerous problems such as multiple occurrences of 'water'
with open(file_path) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
if s.find(seg) != -1:
print('true')
I had no luck with this at all for some reason.
Am I on the right path with any of these or is there a better way to do this?
Extra Example
From Text
The SIDM? model was applied to a rapidly grow-\ning Australian township (Hopetoun)
From Seg
The SIDM model was applied to a rapidly grow-ing Australian township (Hopetoun)
From Text
\nSIDM? is intended to be used both as a design and evaluation tool. As a design tool, it i) guides \nthe design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the \nlevel of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-\ntially provides the scope of work required to advance the design process. As an evaluation tool it can \nact both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally \nacceptable, and as a detailed evaluation tool where various options can be compared in detail in \norder to establish a preferred solution.
From Seg
SIDM is intended to be used both as a design and evaluation tool. As a design tool, it i) guides the design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the level of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-tially provides the scope of work required to advance the design process. As an evaluation tool it can act both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally acceptable, and as a detailed evaluation tool where various options can be compared in detail in order to establish a preferred solution.
List of subs to segment prior to matching:
seg = re.sub(r'\(', r'\\(', seg ) #Need to escape paraenthesis due to regex
seg = re.sub(r'\)', r'\\)', seg )
seg = re.sub(r'\?', r' ', seg )
seg = re.sub(r'[^\x00-\x7F]+',' ', seg)
seg = re.sub(r'\s+', ' ', seg)
seg = re.sub(r'\\r', ' ', seg)
As casimirethippolyte pointed out, patseg = re.sub(r'\W+', '\W+', seg) solved the problem for me.
Related
I´m trying to clean the following data:
from sklearn import datasets
data = datasets.fetch_20newsgroups(categories=['rec.autos', 'rec.sport.baseball', 'soc.religion.christian'])
texts, targets = data['data'], data['target']
Where texts is a list of articles and targets is a vector containing the index of the category to which each article belongs to.
I need to clean all articles. The cleaning task means:
Remove headers
Remove punctuation
Remove parenthesis
Consecutive blank spaces
Tokens emails with length 1
Line breaks
I'm quite new at Python but I've tried to remove all punctuation and everything using replace(). However, I think that an easy way to do this task must exist.
def clean_articles (article):
return ' '.join([x for x in article[article.find('\n\n'):].replace('.','').replace('[','')
clean_articles(data['data'][1])
For the following article:
print(data['data'][1])
Uncleaned Article:
'From: aas7#po.CWRU.Edu (Andrew A. Spencer)\nSubject: Re: Too fast\nOrganization: Case Western Reserve University, Cleveland, OH (USA)\nLines: 25\nReply-To: aas7#po.CWRU.Edu (Andrew A. Spencer)\nNNTP-Posting-Host: slc5.ins.cwru.edu\n\n\nIn a previous article, wrat#unisql.UUCP (wharfie) says:\n\n>In article <1qkon8$3re#armory.centerline.com> jimf#centerline.com (Jim Frost) writes:\n>>larger engine. That\'s what the SHO is -- a slightly modified family\n>>sedan with a powerful engine. They didn\'t even bother improving the\n>>brakes.\n>\n>\tThat shows how much you know about anything. The brakes on the\n>SHO are very different - 9 inch (or 9.5? I forget) discs all around,\n>vented in front. The normal Taurus setup is (smaller) discs front, \n>drums rear.\n\none i saw had vented rears too...it was on a lot.\nof course, the sales man was a fool..."titanium wheels"..yeah, right..\nthen later told me they were "magnesium"..more believable, but still\ncrap, since Al is so m uch cheaper, and just as good....\n\n\ni tend to agree, tho that this still doesn\'t take the SHO up to "standard"\nfor running 130 on a regular basis. The brakes should be bigger, like\n11" or so...take a look at the ones on the Corrados.(where they have\nbraking regulations).\n\nDREW\n'
Cleaned Article:
In previous article UUCP wharfie says In article centerline com com Jim Frost writes larger engine That's what the SHO is slightly modified family sedan with powerful engine They didn't even bother improving the *brakes That shows how much you know about anything The brakes on the SHO are very different inch or forget discs all around vented in front The normal Taurus setup is smaller discs front drums rear one saw had vented rears too it was on lot of course the sales man was fool titanium wheels yeah right then later told me they were magnesium more believable but still crap since Al is so uch cheaper and just as good tend to agree tho that this still doesn't take the SHO up to standard for running 130 on regular basis The brakes should be bigger like 11 or so take look at the ones on the Corrados where they have braking regulations DREW
note: this is not a complete answer, but the following will at least get you half way to:
remove punctuation
remove line breaks
remove consecutive white space
remove parentheses
import re
s = ';\n(a b.,'
print('before:', s)
s = re.sub('[.,;\n(){}\[\]]', '', s)
s = re.sub('\s+', ' ', s)
print('after:', s)
this will print:
before: ;
(a b.,
after: a b
Here is my pattern:
pattern_1a = re.compile(r"(?:```|\n)Item *1A\.?.{0,50}Risk Factors.*?(?:\n)Item *1B(?!u)", flags = re.I|re.S)
Why it does not match text like the following? What's wrong?
"""
Item 1A.
Risk
Factors
If we
are unable to commercialize
ADVEXIN
therapy in various markets for multiple indications,
particularly for the treatment of recurrent head and neck
cancer, our business will be harmed.
under which we may perform research and development services for
them in the future.
42
Table of Contents
We believe the foregoing transactions with insiders were and are
in our best interests and the best interests of our
stockholders. However, the transactions may cause conflicts of
interest with respect to those insiders.
Item 1B.
"""
Here is one solution that will math with your actual text. Put ( ) around your string it will solve a lot of issue. See the solution below.
pattern_1a = re.compile(r"(?:```|\n)(Item 1A)[.\n]{0,50}(Risk Factors)([\n]|.)*(\nItem 1B.)(?!u)", flags = re.I|re.S)
Match evidence:
https://regexr.com/41ejq
The problem is Risk Factors is spread over two lines. It is actually: Risk\nFactors
Using a general white space \s or a new line \n instead of a space matches the text.
simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?
I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/
u'The disclosure relates to systems and methods for detecting features on\n billets of laminated veneer lumber (LVL). In some embodiments, an LVL\n billet is provided and passed through a scanning assembly. The scanning\n assembly includes anx-raygenerator and anx-raydetector. Thex-raygenerator generates a beam ofx-rayradiation and thex-raydetector\n measures intensity of the beam ofx-rayradiation after is passes through\n the LVL billet. The measured intensity is then processed to create an\n image. Images taken according to the disclosure may then be analyzed todetectfeatures on the LVL billet.'
Above is my output.
Now I want to get rid of "\n" in Python.
How can I realize this?
Should I use re module?
I use text to represent all the above text and text.strip("\n") have no use at all.
Why?
thank you!
For a string, s, doing:
s = s.strip("\n")
will only remove the leading and trailing newline characters.
What you want is
s = s.replace("\n", "")
Have you tried the replace function?
s = u'The disclosure relates to systems and methods for detecting features on\n billets of laminated veneer lumber (LVL). In some embodiments, an LVL\n billet is provided and passed through a scanning assembly. The scanning\n assembly includes anx-raygenerator and anx-raydetector. Thex-raygenerator generates a beam ofx-rayradiation and thex-raydetector\n measures intensity of the beam ofx-rayradiation after is passes through\n the LVL billet. The measured intensity is then processed to create an\n image. Images taken according to the disclosure may then be analyzed todetectfeatures on the LVL billet.'
s.replace('\n', '')
try this
''.join(a.split('\n'))
a is the output string
str = 'The disclosure relates to systems and methods for detecting features on\n billets of laminated veneer lumber (LVL).'
str.replace('\n', '')
' '.join(str.split())
Output
The disclosure relates to systems and methods for detecting features on billets of laminated veneer lumber (LVL).
i use this code to split a data to make a list with three sublists.
to split when there is * or -. but it also reads the the \n\n *.. dont know why?
i dont want to read those? can some one tell me what im doing wrong?
this is the data
*Quote of the Day
-Education is the ability to listen to almost anything without losing your temper or your self-confidence - Robert Frost
-Education is what survives when what has been learned has been forgotten - B. F. Skinner
*Fact of the Day
-Fractals, an important part of chaos theory, are very useful in studying a huge amount of areas. They are present throughout nature, and so can be used to help predict many things in nature. They can also help simulate nature, as in graphics design for movies (animating clouds etc), or predict the actions of nature.
-According to a recent survey by Just-Eat, not everyone in The United Kingdom actually knows what the Scottish delicacy, haggis is. Of the 1,623 British people polled:\n\n * 18% of Brits thought haggis was some sort of Scottish animal.\n\n * 15% thought it was a Scottish musical instrument.\n\n * 4% thought it was a character from Harry Potter.\n\n * 41% didn't even know what Scotland's national dish was.\n\nWhile a small number of Scots admitted not knowing what haggis was either, they also discovered that 68% of Scots would like to see Haggis delivered as takeaway.
-With the growing concerns involving Facebook and its ever changing privacy settings, a few software developers have now engineered a website that allows users to trawl through the status updates of anyone who does not have the correct privacy settings to prevent it.\n\nNamed Openbook, the ultimate aim of the site is to further expose the problems with Facebook and its privacy settings to the general public, and show people just how easy it is to access this type of information about complete strangers. The site works as a search engine so it is easy to search terms such as 'don't tell anyone' or 'I hate my boss', and searches can also be narrowed down by gender.
*Pet of the Day
-Scottish Terrier
-Land Shark
-Hamster
-Tse Tse Fly
END
i use this code:
contents = open("data.dat").read()
data = contents.split('*') #split the data at the '*'
newlist = [item.split("-") for item in data if item]
to make that wrong similar to what i have to get list
The "\n\n" is part of the input data, so it's preserved in python. Just add a strip() to remove it:
finallist = [item.strip() for item in newlist]
See the strip() docs: http://docs.python.org/library/stdtypes.html#str.strip
UPDATED FROM COMMENT:
finallist = [item.replace("\\n", "\n").strip() for item in newlist]
open("data.dat").read() - reads all symbols in file, not only those you want.
If you don't need '\n' you can try content.replace("\n",""), or read lines (not whole content), and truncate the last symbol'\n' of each line.
This is going to split any asterisk you have in the text as well.
Better implementation would be to do something like:
lines = []
for line in open("data.dat"):
if line.lstrip.startswith("*"):
lines.append([line.strip()]) # append a list with your line
elif line.lstrip.startswith("-"):
lines[-1].append(line.strip())
For more homework, research what's happening when you use the open() function in this way.
The following solves your problem i believe:
result = [ [subitem.replace(r'\n\n', '\n') for subitem in item.split('\n-')]
for item in open('data.txt').read().split('\n*') ]
# now let's pretty print the result
for i in result:
print '***', i[0], '***'
for j in i[1:]:
print '\t--', j
print
Note I split on new-line + * or -, in this way it won't split on dashes inside the text. Also i replace the textual character sequence \ n \ n (r'\n\n') with a new line character '\n'. And the one-liner expression is list comprehension, a way to construct lists in one gulp, without multiple .append() or +