I have a file of accession numbers and 16S rrna sequences, and what I'm trying to do is remove all lines of RNA, and only keep the lines with the accession numbers and the species name (and remove all the junk in between). So my input file looks like this (there are > in front of the accession numbers):
> D50541 1 1409 1409bp rna Abiotrophia defectiva Aerococcaceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACCGAAGCAU CUUCGGAUGC UUAGUGGCGA ACGGGUGAGU AACACGUAGA
UAACCUACCC UAGACUCGAG GAUAACUCCG GGAAACUGGA GCUAAUACUG GAUAGGAUAU AGAGAUAAUU UCUUUAUAUU
(... and many more lines)
> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACGCUCUAUA GCAAUAUAGG GAGUGGCGAA CGGGUGAGUA ACACGUAGAU
AACCUACCCU UACUUCGAGG AUAACUUCGG GAAACUGGAG CUAAUACUGG AUAGGACAUA UUGAGGCAUC UUAAUAUGUU
...
I want my output to look like this:
>D50541 Abiotrophia defectiva Aerococcaceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
The code I wrote does what I want... for most of the lines. It looks like this:
#!/usr/bin/env python
# take LTPs111.compressed fasta and reduce to accession numbers with names.
import re
infilename = 'LTPs111.compressed.fasta'
outfilename = 'acs.fasta'
regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
#remove extra letters and spaces
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
x = regex.sub(r'\1\2 \3', line)
#remove rna sequences
for line in x:
if '>' in line:
outfile.write(x)
Sometimes, the code seems to skip over some of the names. for example, for the first accession number above, I only got back:
>D50541 Aerococcaceae
Why might my code be doing this? The input for each accession number looks identical, and the spacing between 'rna' and the first name is the same for each line (5 spaces).
Thank you to anyone who might have some ideas!
I still haven't been able to run your code to get the claimed results, but I think I know what the problem is:
>>> line = '> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae'
>>> regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
>>> regex.findall(line)
[('>', 'AY538167', 'Acholeplasmataceae')]
The problem is that [rna]\s+ matches any one of the characters r, n, or a at the end of a word. And, because all of the matches are greedy, with no lookahead or anything else to prevent it, this means that it matches the n at the end of hippikon.
The simple solution is to remove the brackets, so it matches the string rna:
>>> regex = re.compile(r'(>)\s(\w+).+rna\s+([A-Z].+)')
That won't work if any of your species or genera can end with that string. Are there any such names? If so, you need to come up with a better way to describe the cutoff between the 1409bp part and the rna part. The simplest may be to just look for rna surrounded by spaces:
>>> regex = re.compile(r'(>)\s(\w+).+\s+rna\s+([A-Z].+)')
Whether this is actually correct or not, I can't say without knowing more about the format, but hopefully you understand what I'm doing well enough to verify that it's correct (or at least to ask smarter questions than I can ask).
It may help debug things to add capture groups. For example, instead of this:
(>)\s(\w+).+[rna]\s+([A-Z].+)
… search for this:
(>)(\s)(\w+)(.+[rna]\s+)([A-Z].+)
Obviously your desired capture groups are now \1\3 \5 instead of \1\2 \3… but the big thing is that you can see what got matched in \4:
[('>', ' ', 'AY538167', ' 1 1411 1411bp Acholeplasma hippikon ', 'Acholeplasmataceae')]
So, now the question is "Why did .+[rna]\s+ match '1 1411 1411bp Acholeplasma hippikon '? Sometimes the context matters, but in this case, it doesn't. You don't want that group to match that string in any context, and yet it will always match it, so that's the part you have to debug.
Also, a visual regexp explorer often helps a lot. The best ones can color parts of the expression and the matched text, etc., to show you how and why the regexp is doing what it does.
Of course you're limited by those that run on your platform or online, and work with Python syntax. If you're careful and/or only use simple features (as in your example), perl/PCRE syntax is very close to Python, and JavaScript/ActionScript is also pretty close (the one big difference to keep in mind is that replace/sub uses $ instead of \1).
I don't have a good online one to strongly recommend, but from a quick glance Debuggex looks pretty cool.
Items between brackets are character classes, so by setting your regex to look for "[rna]" you are requesting lines with either r, n, or a, but not all three.
Further, if the lines you want all have the pattern "bp rna", I'd use that to yank those lines. By reading the file in line by line, the following worked for me for a quick and dirty line-yanker, for instance:
regex = re.compile(r'^[\w\s]+bp rna .*$')
But, again, if it's as simple as finding lines with "bp rna" in them, you could read the file line by line and forego regex entirely:
for line in file:
if "bp rna" in line:
print(line)
EDIT: I blew it by not reading the request carefully enough. Maybe a capture-and-replace regex would help?
for line in file:
if "bp rna" in line:
subreg = re.sub(r'^(>[\w]+)\s[\d\s]+bp\srna\s([\w\s]+$)', r"\1 \2", line)
print(subreg)
OUTPUT:
>AY538166 Acholeplasma granularum Acholeplasmataceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
This should match any whitespace (tabs or spaces) between the things you want.
Related
So I have several examples of raw text in which I have to extract the characters after 'Terms'. The common pattern I see is after the word 'Terms' there is a '\n' and also at the end '\n' I want to extract all the characters(words, numbers, symbols) present between these to \n but after keyword 'Terms'.
Some examples of text are given below:
1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n
The code I have written is given below:
def get_term_regex(s):
raw_text = s
term_regex1 = r'(TERMS\s*\\n(.*?)\\n)'
try:
if ('TERMS' or 'Terms') in raw_text:
pattern1 = re.search(term_regex1,raw_text)
#print(pattern1)
return pattern1
except:
pass
But I am not getting any output, as there is no match.
The expected output is:
1) Direct deposit; Routing #256078514, acct. #160935
2) Due on receipt
3) NET 30 DAYS
Any help would be really appreciated.
Try the following:
import re
text = '''1) \nTERMS \nDirect deposit; Routing #256078514, acct. #160935\n\n'
2) \nTerms\nDue on receipt\nDue Date\n1/31/2021
3) \nTERMS: \nNET 30 DAYS\n''' # \n are real new lines
for m in re.finditer(r'(TERMS|Terms)\W*\n(.*?)\n', text):
print(m.group(2))
Note that your regex could not deal with the third 'line' because there is a colon : after TERMS. So I replaced \s with \W.
('TERMS' or 'Terms') in raw_text might not be what you want. It does not raise a syntax error, but it is just the same as 'TERMS' in raw_text; when python evaluates the parenthesis part, both 'TERMS' and 'Terms' are all truthy, and therefore python just takes the last truthy value, i.e., 'Terms'. The result is, TERMS cannot be picked up by that part!
So you might instead want someting like ('TERMS' in raw_text) or ('Terms' in raw_text), although it is quite verbose.
I want to replace numbers from a text file in a new text file. I tried to solve it with the function Dictionary, but now python also replaces the substrings.
For Example: I want to replace the number 014189 to 1489, with this code it also replaces 014896 to 1489 - how can i get rid of this? Thank you!!!
replacements = {'01489':'1489', '01450':'1450'}
infile = open('test_fplan.txt', 'r')
outfile = open('test_fplan_neu.txt', 'w')
for line in infile:
for src, target in replacements.iteritems():
line = line.replace(src, target)
outfile.write(line)
I don't know how your input file looks, but if the numbers are surrounded by spaces, this should work:
replacements = {' 01489 ':' 1489 ', ' 01450 ':' 1450 '}
It looks like your concern is that it's also modifying numbers that contain your src pattern as a substring. To avoid that, you'll need to first define the boundaries that should be respected. For instance, do you want to insist that only matched numbers surrounded by spaces get replaced? Or perhaps just that there be no adjacent digits (or periods or commas). Since you'll probably want to use regular expressions to constrain the matching, as suggested by JoshuaF in another answer, you'll likely need to avoid the simple replace function in favor of something from the re library.
Use regexp with negative lookarounds:
import re
replacements = {'01489':'1489', '01450':'1450'}
def find_replacement(match_obj):
number = match_obj.group(0)
return replacements.get(number, number)
with open('test_fplan.txt') as infile:
with open('test_fplan_neu.txt', 'w') as outfile:
outfile.writelines(
re.sub(r'(?<!\d)(\d+)(?!\d)', find_replacement, line)
for line in infile
)
Check out the regular expression syntax https://docs.python.org/2/library/re.html. It should allow you to match whatever pattern you're looking for exactly.
I know that there are a bunch of other similar questions to this, but I have built off other answers with no success.
I've dug here, here, here, here, and here
but this question is closest to what I'm trying to do, however it's in php and I'm using python3
My goal is to extract a substring from a body text.
The body is formatted:
**Header1**
thing1
thing2
thing3
thing4
**Header2**
dsfgs
sdgsg
rrrrrr
**Hello Dolly**
abider
abcder
ffffff
etc.
Formatting on SO is tough. But in the actual text, there's no spaces, just newlines for each line.
I want what's under Header2, so currently I have:
found = re.search("\*\*Header2\*\*\n[^*]+",body)
if found:
list = found.group(0)
list = list[11:]
list = list.split('\n')
print(list)
But that's returning "None". Various other regex I've tried also haven't worked, or grabbed too much (all of the remaining headers).
For what it's worth I've also tried:
\*\*Header2\*\*.+?^\**$
\*\*Header2\*\*[^*\s\S]+\*\* and about 10 other permutations of those.
Brief
Your pattern \*\*Header2\*\*\n[^*]+ isn't matching because your line **Header2** includes trailing spaces before the newline character. Adding * should suffice, but I've added other options below as well.
Code
See regex in use here
\*{2}Header2\*{2} *\n([^*]+)
Alternatively, you can also use the following regex (which also allows you to capture lines with * in them so long as they don't match the format of your header ^\*{2}[^*]*\*{2} - it also beautifully removes whitespace from the last element under the header - uses the im flags):
See regex in use here
^\*{2}Header2\*{2} *\n((?:(?!^\*{2}[^*]*\*{2}).)*?)(?=\s*^\*{2}[^*]*\*{2}|\s*\Z)
Usage
See code in use here
import re
regex = r"\*{2}Header2\*{2}\s*([^*]+)\s*"
test_str = ("**Header1** \n"
"thing1 \n"
"thing2 \n"
"thing3 \n"
"thing4 \n\n"
"**Header2** \n"
"dsfgs \n"
"sdgsg \n"
"rrrrrr \n\n"
"**Hello Dolly** \n"
"abider \n"
"abcder \n"
"ffffff")
print(re.search(regex, test_str).group(1))
Explanation
The pattern is practically identical to the OP's original pattern. I made minor changes to allow it to better perform and also get the result the OP is expecting.
\*\* changed to \*{2}: Very minor adjustment for performance
\n changed to *\n: Takes additional spaces at the end of a line into account before the newline character
([^*]+): Captures the contents the OP is expecting into capture group 1
You could use
^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)
with the multiline and verbose modifier, see a demo on regex101.com.
Afterwards, just grab what is inside content (i.e. using re.finditer()).
Broken down this says:
^\*\*Header2\*\*.*[\n\r] # match **Header2** at the start of the line
# and newline characters
(?P<content>(?:.+[\n\r])+) # afterwards match as many non-null lines as possible
In Python:
import re
rx = re.compile(r'''
^\*\*Header2\*\*.*[\n\r]
(?P<content>(?:.+[\n\r])+)
''', re.MULTILINE | re.VERBOSE)
for match in rx.finditer(your_string_here):
print(match.group('content'))
I have the feeling that you even want to allow empty lines between paragraphs. If so, change the expression to
^\*\*Header2\*\*.*[\n\r]
(?P<content>[\s\S]+?)
(?=^\*\*)
See a demo for the latter on regex101.com as well.
You can try this:
import re
s = """
**Header1**
thing1
thing2
thing3
thing4
**Header2**
dsfgs
sdgsg
rrrrrr
**Hello Dolly**
abider
abcder
ffffff
"""
new_contents = re.findall('(?<=\*\*Header2\*\*)[\n\sa-zA-Z0-9]+', s)
Output:
[' \ndsfgs \nsdgsg \nrrrrrr \n\n']
If you want to remove special characters from the output, you can try this:
final_data = filter(None, re.split('\s+', re.sub('\n+', '', new_contents[0])))
Output:
['dsfgs', 'sdgsg', 'rrrrrr']
My code does the following:
Take a large text file (i.e. a legal document that is 300 pages as a PDF).
Find a certain keyword (e.g. "small").
Return n words to the left and n words to the right of the keyword.
NOTE: In this context, a "word" is any string of non-space characters. "$cow123" would be a word, but "health care" would be two words.
Here is my problem:
The code takes an extremely long time to run on the 300 pages, and that time tends to increase very quickly as n increases.
Here is my code:
fileHandle = open('test_pdf.txt', mode='r')
document = fileHandle.read()
def search(searchText, doc, n):
#Searches for text, and retrieves n words either side of the text, which are returned separately
surround = r"\s*(\S*)\s*"
groups = re.search(r'{}{}{}'.format(surround*n, searchText, surround*n), doc).groups()
return groups[:n],groups[n:]
Here is the nasty culprit:
print search("\$27.5 million", document, 10)
Here's how you can test this code:
Copy the function definition from the code block above and run the following:
t = "The world is a small place, we $.205% try to take care of it."
print search("\$.205", t, 3)
I suspect that I have a nasty case of catastrophic backtracking, but I'm too new to regex to point my finger on the problem.
How do I speed up my code?
How about using re.search (or even string.find if you're only searching for fixed strings) to find the string, without any surrounding capturing groups. Then you use the position and length of the match (.start and .end on a re matchobject, or the return value of find plus the length of the search string). Get the substring before the match and do /\s*(\S*)\s*\z/ etc. on it, and get the substring after the match and do /\A\s*(\S*)\s*/ etc. on it.
Also, for help with your backtracking: you can use a pattern like \s+\S+\s+ instead of \s*\S*\s* (two chunks of whitespace have to be separated by a non-zero amount of non-whitespace, or else they wouldn't be two chunks), and you shouldn't butt up two consecutive \s*s like you do. I think r'\S+'.join([[r'\s+']*(n)) would give the right pattern for capturing n previous words (but my Python is rusty, so check that).
I see several problems here. The First, and probably worst, is that everything in your "surround" regex is, not just optional but independently optional. Given this string:
"Lorem ipsum tritani impedit civibus ei pri"
...when searchText = "tritani" and n = 1, this is what it has to go through before it finds the first match:
regex: \s* \S* \s* tritani
offset 0: '' 'Lorem' ' ' FAIL
'' 'Lorem' '' FAIL
'' 'Lore' '' FAIL
'' 'Lor' '' FAIL
'' 'Lo' '' FAIL
'' 'L' '' FAIL
'' '' '' FAIL
...then it bumps ahead one position and starts over:
offset 1: '' 'orem' ' ' FAIL
'' 'orem' '' FAIL
'' 'ore' '' FAIL
'' 'or' '' FAIL
'' 'o' '' FAIL
'' '' '' FAIL
... and so on. According to RegexBuddy's debugger, it takes almost 150 steps to reach the offset where it can make the first match:
position 5: ' ' 'ipsum' ' ' 'tritani'
And that's with just one word to skip over, and with n=1. If you set n=2 you end up with this:
\s*(\S*)\s*\s*(\S*)\s*tritani\s*(\S*)\s*\s*(\S*)\s*
I sure you can see where this is is going. Note especially that when I change it to this:
(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)tritani(?:\s+)(\S+)(?:\s+)(\S+)(?:\s+)
...it finds the first match in a little over 20 steps. This is one of the most common regex anti-patterns: using * when you should be using +. In other words, if it's not optional, don't treat it as optional.
Finally, you may have noticed the \s*\s* the auto-generated regex
You could try using mmap and appropriate regex flags, eg (untested):
import re
import mmap
with open('your file') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
for match in re.finditer(your_re, mf, flags=re.DOTALL):
print match.group() # do something with your match
This'll only keep memory usage lower though...
The alternative is to have a sliding window of words (simple example of just single word before and after)...:
import re
import mmap
from itertools import islice, tee, izip_longest
with open('testingdata.txt') as fin:
mf = mmap.mmap(fin.fileno(), 0, access=mmap.ACCESS_READ)
words = (m.group() for m in re.finditer('\w+', mf, flags=re.DOTALL))
grouped = [islice(el, idx, None) for idx, el in enumerate(tee(words, 3))]
for group in izip_longest(*grouped, fillvalue=''):
if group[1] == 'something': # check criteria for group
print group
I think you are going about this completely backwards (I'm a little confused as to what you are doing in the first place!)
I would recommend checking out the re_search function I developed in the textools module of my cloud toolbox
with re_search you could solve this problem with something like:
from cloudtb import textools
data_list = textools.re_search('my match', pdf_text_str) # search for character objects
# you now have a list of strings and RegPart objects. Parse through them:
for i, regpart in enumerate(data_list):
if isinstance(regpart, basestring):
words = textools.re_search('\w+', regpart)
# do stuff with words
else:
# I Think you are ignoring these? Not totally sure
Here is a link on how to use and how it works:
http://cloudformdesign.com/?p=183
In addition to this, your regular expressions would also be printed out in more readable format.
You might also want to check out my tool Search The Sky or the similar tool Kiki to help you build and understand your regular expressions.
I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.
Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)
If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least