u'The disclosure relates to systems and methods for detecting features on\n billets of laminated veneer lumber (LVL). In some embodiments, an LVL\n billet is provided and passed through a scanning assembly. The scanning\n assembly includes anx-raygenerator and anx-raydetector. Thex-raygenerator generates a beam ofx-rayradiation and thex-raydetector\n measures intensity of the beam ofx-rayradiation after is passes through\n the LVL billet. The measured intensity is then processed to create an\n image. Images taken according to the disclosure may then be analyzed todetectfeatures on the LVL billet.'
Above is my output.
Now I want to get rid of "\n" in Python.
How can I realize this?
Should I use re module?
I use text to represent all the above text and text.strip("\n") have no use at all.
Why?
thank you!
For a string, s, doing:
s = s.strip("\n")
will only remove the leading and trailing newline characters.
What you want is
s = s.replace("\n", "")
Have you tried the replace function?
s = u'The disclosure relates to systems and methods for detecting features on\n billets of laminated veneer lumber (LVL). In some embodiments, an LVL\n billet is provided and passed through a scanning assembly. The scanning\n assembly includes anx-raygenerator and anx-raydetector. Thex-raygenerator generates a beam ofx-rayradiation and thex-raydetector\n measures intensity of the beam ofx-rayradiation after is passes through\n the LVL billet. The measured intensity is then processed to create an\n image. Images taken according to the disclosure may then be analyzed todetectfeatures on the LVL billet.'
s.replace('\n', '')
try this
''.join(a.split('\n'))
a is the output string
str = 'The disclosure relates to systems and methods for detecting features on\n billets of laminated veneer lumber (LVL).'
str.replace('\n', '')
' '.join(str.split())
Output
The disclosure relates to systems and methods for detecting features on billets of laminated veneer lumber (LVL).
Related
I have a list of titles that I need to normalize. For example, if a title contains 'CTO', it needs to be changed to 'Chief Technology Officer'. However, I only want to replace 'CTO' if there is no letter directly to the left or right of 'CTO'. For example, 'Director' contains 'cto'. I obviously wouldn't want this to be replaced. However, I do want it to be replaced in situations where the title is 'Founder/CTO' or 'CTO/Founder'.
Is there a way to check if a letter is before 'CXO' using regex? Or what would be the best way to accomplish this task?
EDIT:
My code is as follows...
test = 'Co-Founder/CTO'
test = re.sub("[^a-zA-Z0-9]CTO", 'Chief Technology Officer', test)
The result is 'Co-FounderChief Technology Officer'. The '/' gets replaced for some reason. However, this doesn't happen if test = 'CTO/Co-Founder'.
What you want is a regex that excludes a list of stuff before a point:
"[^a-zA-Z0-9]CTO"
But you actually also need to check for when CTO occurs at the beginning of the line:
"^CTO"
To use the first expression within re.sub, you can add a grouping operator (()s) and then use it in the replacement to pull out the matching character (eg, space or /):
re.sub("([^a-zA-Z0-9])CTO","\\1Chief Technology Officer", "foo/CTO")
Will result in
'foo/Chief Technology Officer'
Answer: "(?<=[^a-zA-Z0-9])CTO|^CTO"
Lookbehinds are perfect for this
cto_re = re.compile("(?<=[^a-zA-Z0-9])CTO")
but unfortunately won't work for the start of lines (due only to the python implementation requiring fixed length).
for eg in "Co-Founder/CTO", "CTO/Bossy", "aCTOrMan":
print(cto_re.sub("Chief Technology Officer", eg))
Co-Founder/Chief Technology Officer
CTO/Bossy
aCTOrMan
You would have to check for that explicitly via |:
cto_re = re.compile("(?<=[^a-zA-Z0-9])CTO|^CTO")
for eg in "Co-Founder/CTO", "CTO/Bossy", "aCTOrMan":
print(cto_re.sub("Chief Technology Officer", eg))
Co-Founder/Chief Technology Officer
Chief Technology Officer/Bossy
aCTOrMan
Here is my pattern:
pattern_1a = re.compile(r"(?:```|\n)Item *1A\.?.{0,50}Risk Factors.*?(?:\n)Item *1B(?!u)", flags = re.I|re.S)
Why it does not match text like the following? What's wrong?
"""
Item 1A.
Risk
Factors
If we
are unable to commercialize
ADVEXIN
therapy in various markets for multiple indications,
particularly for the treatment of recurrent head and neck
cancer, our business will be harmed.
under which we may perform research and development services for
them in the future.
42
Table of Contents
We believe the foregoing transactions with insiders were and are
in our best interests and the best interests of our
stockholders. However, the transactions may cause conflicts of
interest with respect to those insiders.
Item 1B.
"""
Here is one solution that will math with your actual text. Put ( ) around your string it will solve a lot of issue. See the solution below.
pattern_1a = re.compile(r"(?:```|\n)(Item 1A)[.\n]{0,50}(Risk Factors)([\n]|.)*(\nItem 1B.)(?!u)", flags = re.I|re.S)
Match evidence:
https://regexr.com/41ejq
The problem is Risk Factors is spread over two lines. It is actually: Risk\nFactors
Using a general white space \s or a new line \n instead of a space matches the text.
simple example: func-tional --> functional
The story is that I got a Microsoft Word document, which is converted from PDF format, and some words remain hyphenated (such as func-tional, broken because of line break in PDF). I want to recover those broken words while normal ones(i.e., "-" is not for word-break) are kept.
In order to make it more clear, one long example (source text) is added:
After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance.
Could someone give me some suggestions on this problem?
I would use regular expression. This little script searches for words with hyphenated and replaces the hyphenated by nothing.
import re
def replaceHyphenated(s):
matchList = re.findall(r"\w+-\w+",s) # find combination of word-word
sOut = s
for m in matchList:
new = m.replace("-","")
sOut = sOut.replace(m,new)
return sOut
if __name__ == "__main__":
s = """After the symposium, the Foundation and the FCF steering team continued their work and created the Func-tional Check Flight Compendium. This compendium contains information that can be used to reduce the risk of functional check flights. The information contained in the guidance document is generic, and may need to be adjusted to apply to your specific aircraft. If there are questions on any of the information in the compendi-um, contact your manufacturer for further guidance."""
print(replaceHyphenated(s))
output would be:
After the symposium, the Foundation and the FCF steering team
continued their work and created the Functional Check Flight
Compendium. This compendium contains information that can be used to
reduce the risk of functional check flights. The information contained
in the guidance document is generic, and may need to be adjusted to
apply to your specific aircraft. If there are questions on any of the
information in the compendium, contact your manufacturer for further
guidance.
If you are not used to RegExp I recommend this site:
https://regex101.com/
I have two csv's. One with a large chunk of text and the other with annotations/strings. I want to find the position of the annotation in the text. The problem is some of the annotations have extra space/characters that are not in the text. I can not trim white space/ characters from the original text since I need the exact position. I started out using regex but it seems there is no way to search for partial matches.
Example
text = ' K. Meney & L. Pantelic, Int. J. Sus. Dev. Plann. Vol. 10, No. 4 (2015) 544?561\n? 2015 WIT Press, www.witpress.com\nISSN: 1743-7601 (paper format), ISSN: 1743-761X (online), http://www.witpress.com/journals\nDOI: 10.2495/SDP-V10-N4-544-561\nNOVEL DECISION MODEL FOR DELIVERING SUSTAINABLE \nINFRASTRUCTURE SOLUTIONS ? AN AUSTRALIAN \nCASE STUDY\nK. MENEY & L. PANTELIC\nSyrinx Environmental PL, Australia.\nABSTRACT\nConventional approaches to water supply and wastewater treatment in regional towns globally are failing \ndue to population growth and resource pressure, combined with prohibitive costs of infrastructure upgrades. '
seg = 'water supply and wastewater ¿treatment'
m = re.search(seg, text, re.M | re.DOTALL | re.I)
this matchs on about 15% segs
m = re.match(r'(water).*(treatment)$', text, re.M)
this did not work, I thought it would be possible to match on the first and last words and get their positions but this has numerous problems such as multiple occurrences of 'water'
with open(file_path) as file, \
mmap.mmap(file.fileno(), 0, access=mmap.ACCESS_READ) as s:
if s.find(seg) != -1:
print('true')
I had no luck with this at all for some reason.
Am I on the right path with any of these or is there a better way to do this?
Extra Example
From Text
The SIDM? model was applied to a rapidly grow-\ning Australian township (Hopetoun)
From Seg
The SIDM model was applied to a rapidly grow-ing Australian township (Hopetoun)
From Text
\nSIDM? is intended to be used both as a design and evaluation tool. As a design tool, it i) guides \nthe design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the \nlevel of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-\ntially provides the scope of work required to advance the design process. As an evaluation tool it can \nact both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally \nacceptable, and as a detailed evaluation tool where various options can be compared in detail in \norder to establish a preferred solution.
From Seg
SIDM is intended to be used both as a design and evaluation tool. As a design tool, it i) guides the design of sustainable infrastructure solutions, ii) can be used as a progress check to assess the level of completion of a project, iii) highlights gaps in the existing information sets, and iv) essen-tially provides the scope of work required to advance the design process. As an evaluation tool it can act both as a quick diagnostic tool, to check whether or not a solution has major flaws or is generally acceptable, and as a detailed evaluation tool where various options can be compared in detail in order to establish a preferred solution.
List of subs to segment prior to matching:
seg = re.sub(r'\(', r'\\(', seg ) #Need to escape paraenthesis due to regex
seg = re.sub(r'\)', r'\\)', seg )
seg = re.sub(r'\?', r' ', seg )
seg = re.sub(r'[^\x00-\x7F]+',' ', seg)
seg = re.sub(r'\s+', ' ', seg)
seg = re.sub(r'\\r', ' ', seg)
As casimirethippolyte pointed out, patseg = re.sub(r'\W+', '\W+', seg) solved the problem for me.
I am trying to run blastn through biopython with NCBIWWW.
I am using the qblast function on a given sample file.
I have a few methods defined and everything works like a charm when my fasta contains sequences that are long enough. The only case where it fails it is when I need to blast reads coming from Illumina sequencing that are too short. So I would say it is probably due to the fact that there no automatic redefinition of blasting parameters when submitting the work.
I tried everything I could to come close to blastn-short conditions (see table C2 from here) without any success.
It looks like I am not capable to feed in the correct parameters.
The closer I think I came to working situation is with the following :
result_handle = NCBIWWW.qblast("blastn", "nr",
fastaSequence,
word_size=7,
gapcosts='5 2',
nucl_reward=1,
nucl_penalty='-3',
expect=1000)
Thank you for any tip / advice to make it work.
My sample fasta read is the following one :
>TEST 1-211670
AGACTGCGATCCGAACTGAGAAC
The error that I get is the following one :
>ValueError: Error message from NCBI: Message ID#24 Error: Failed to read the Blast query: Protein FASTA provided for nucleotide sequence
And when I look at this page, it seems that my problem is about fixing the threshold but obviously I didn't manage to make it work so far.
Thank you for any help.
Once I had problems with blasting peptides and it appeared that it was an issue of proper parameters selection. It took me terribly long time to find out what they actually should be (inconsistent and scarce data on various websites including quite convoluted in this aspect NCBI documentation). I know you are interested in blasting nucleotide sequences but supposedly you will find your solution whilst having a look on the code below. Pay attention especially to params as filter, composition_based_statistics, word_size and matrix_name. In my case they appeared to be crucial.
blast_handle = NCBIWWW.qblast("blastp", "nr",
peptide_seq,
expect=200000.0,
filter=False,
word_size=2,
composition_based_statistics=False,
matrix_name="PAM30",
gapcosts="9 1",
hitlist_size=1000)
This code works for me (Biopython 1.64):
from Bio.Blast import NCBIWWW
fastaSequence = ">TEST 1-211670\nAGACTGCGATCCGAACTGAGAAC"
result_handle = NCBIWWW.qblast("blastn", "nr",
fastaSequence,
word_size=7,
gapcosts='5 2',
nucl_reward=1,
nucl_penalty='-3',
expect=1000)
print result_handle.getvalue()
Maybe you are passing a wrong fastaSequence. Biopython doesn't make any transformation from SeqRecords (or anything) to plain FASTA. You have to provide the query as shown above.
Blast determines if a sequence is a nucleotide or a protein reading the first few chars. If they are in the "ACGT" above a threshold, it's a nucleotide, otherwise it's a protein. Thus your sequence is at a 100% threshold of "ACGT", impossible to be interpreted as a protein.