How to substitute specific patterns in python - python

I want to replace all occurrences of integers which are greater than 2147483647 and are followed by ^^<int> by the first 3 digits of the numbers. For example, I have my original data as:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "25500000000"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
I want to replace the original data by the below mentioned data:
<stack_overflow> <isA> "QuestionAndAnsweringWebsite" "fact".
"Ask a Question" <at> "255"^^<int> <stack_overflow> .
<basic> "language" "89028899" <html>.
The way I have implemented is by scanning the data line by line. If I find numbers greater than 2147483647, I replace them by the first 3 digits. However, I don't know how should I check that the next part of the string is ^^<int> .
What I want to do is: for numbers greater than 2147483647 e.g. 25500000000, I want to replace them with the first 3 digits of the number. Since my data is 1 Terabyte in size, a faster solution is much appreciated.

Use the re module to construct a regular expression:
regex = r"""
( # Capture in group #1
"[\w\s]+" # Three sequences of quoted letters and white space characters
\s+ # followed by one or more white space characters
"[\w\s]+"
\s+
"[\w\s]+"
\s+
)
"(\d{10,})" # Match a quoted set of at least 10 integers into group #2
(^^\s+\.\s+) # Match by two circumflex characters, whitespace and a period
# into group #3
(.*) # Followed by anything at all into group #4
"""
COMPILED_REGEX = re.compile(regex, re.VERBOSE)
Next, we need to define a callback function (since re.RegexObject.sub takes a callback) to handle the replacement:
def replace_callback(matches):
full_line = matches.group(0)
number_text = matches.group(2)
number_of_interest = int(number_text, base=10)
if number_of_interest > 2147483647:
return full_line.replace(number_of_interest, number_text[:3])
else:
return full_line
And then find and replace:
fixed_data = COMPILED_REGEX.sub(replace_callback, YOUR_DATA)
If you have a terrabyte of data you will probably not want to do this in memory - you'll want to open the file and then iterate over it, replacing the data line by line and writing it back out to another file (there are undoubtedly ways to speed this up, but they will make the gist of the technique harder to follow:
# Given the above
def process_data():
with open("path/to/your/file") as data_file,
open("path/to/output/file", "w") as output_file:
for line in data_file:
fixed_data = COMPILED_REGEX.sub(replace_callback, line)
output_file.write(fixed_data)

If each line in your text file looks like your example, then you can do this:
In [2078]: line = '"QuestionAndAnsweringWebsite" "fact". "Ask a Question" "25500000000"^^ . "language" "89028899"'
In [2079]: re.findall('\d+"\^\^', line)
Out[2079]: ['25500000000"^^']
with open('path/to/input') as infile, open('path/to/output', 'w') as outfile:
for line in infile:
for found in re.findall('\d+"\^\^', line):
if int(found[:-3]) > 2147483647:
line = line.replace(found, found[:3])
outfile.write(line)
Because of the inner for-loop, this has the potential to be an inefficient solution. However, I can't think of a better regex at the moment, so this should get you started, at the very least

Related

how to parse line start with multiple key words in python?

I am trying to parse input txt file where I want to check each line start with possible multiple key words and do incremental. The input text is sql like txt and I am checking multiple key words, if found it to check the pattern in next to it. To do so, I got some help and able to parse some. But when I tried to add other possible key words, some of the lines were not parsed, and getting value error instead. I checked with re documentation, understood how to check line start with multiple strings, but still end up with some value error. Can anyone able to take a look and correct me?
Use case:
So each line, I am expecting to see one of these as starting of each line: CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, so I do this by using #flakes help in previous post:
import re, os, sys
matcher = re.compile(r"^[(CREATE(?: OR REPLACE)?) | END | (INPUT (?: FROM)) | (INSERT(?:INTO))] (\S+) (\S+).*$")
with open('input.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
But nothing is printed out, I check this pattern at https://regex101.com but end up with this value error:
ValueError: not enough values to unpack (expected 3, got 2)
I used this input file for parsing using re.
objective
Basically, I need to check each line to see whether it start with CREATE, CREATE OR REPLACE, END, INPUT, INPUT FROM, INSERT INTO, if yes, then print next word, check pattern of next next word of the key word. Can anyone point me out what went wrong with the pattern that I composed?
If you always want to return 3 groups, you can make the last group capture optional non whitespace chars as that value is not always present in the example data.
You should use a single capture group for all the variants at the start of the string, and put the optional groups in a non capture group.
Note that you should not put the intented groupings in square bracktes, as they will change the meaning to a character class matching the characters that are listed in it.
The updated pattern:
^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S+) ?(\S*)
See the capture groups in the regex demo
import re
matcher = re.compile(r"^(CREATE(?: OR REPLACE)?|END|INPUT FROM|INSERT INTO) (\S+) ?(\S*)")
with open('input.txt', 'r+') as f:
lines = f.readlines()
nlines = [v for v in lines if not v.isspace()]
for line in nlines:
if match := matcher.match(line):
component, component_type, component_name = match.groups()
print(f"{component=},{component_type=}, {component_name=}")
Output
component='CREATE',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi'
component='CREATE',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE OR REPLACE',component_type='SOURCE', component_name='comp_src_vat4_sac_rec_cln1_tst_noi'
component='END',component_type='FLOW', component_name='flow_src_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='CREATE',component_type='STREAM', component_name='vat4_sac_rec_cln1_tst_noi_text_clean_stream'
component='CREATE',component_type='STREAM', component_name='CQ_STREAM_vat4_sac_rec_cln1_tst_noi'
component='CREATE OR REPLACE',component_type='TARGET', component_name='comp_tgt_vat4_sac_rec_cln1_tst_noi'
component='INPUT FROM',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream;', component_name=''
component='CREATE OR REPLACE',component_type='CQ', component_name='CQ_vat4_sac_rec_cln1_tst_noi'
component='INSERT INTO',component_type='CQ_STREAM_vat4_sac_rec_cln1_tst_noi', component_name=''
component='CREATE OR REPLACE',component_type='OPEN', component_name='PROCESSOR'
component='INSERT INTO',component_type='vat4_sac_rec_cln1_tst_noi_text_clean_stream', component_name=''
component='END',component_type='FLOW', component_name='flow_tgt_vat4_sac_rec_cln1_tst_noi;'
component='END',component_type='application', component_name='vat4_sac_rec_cln1_tst_noi;'

Positive lookbehind, slice words by \t until \n

I am new to regex. I am attempting to use regex with python to find a line in a file and extract all of the subsequent words separated by tab stops. My line looks like this.
#position 4450 4452 4455 4465 4476 4496 D110 D111 D112 D114 D116 D118 D23 D24 D27 D29 D30 D56 D59 D69 D85 D88 D90 D91 JW1 JW10 JW15 JW22 JW28 JW3 JW35 JW39 JW43 JW45 JW47 JW49 JW5 JW52 JW54 JW56 JW57 JW59 JW66 JW7 JW70 JW75 JW77 JW9 REF_OR74A
I have identified that the base of this expression involves the positive lookbehind.
(?<=#position).*
I do not expect this to separate the matches by tabstop. However, it does find my line in the file:
import re
file = open('src.txt','r')
f = list(file)
file.close()
pattern = '(?<=#position).*'
regex = re.compile(pattern)
regex.findall(''.join(f))
['\t4450\t4452\t4455\t4465\t4476\t4496\tD110\tD111\tD112\tD114\tD116\tD118\tD23\tD24\tD27\tD29\tD30\tD56\tD59\tD69\tD85\tD88\tD90\tD91\tJW1\tJW10\tJW15\tJW22\tJW28\tJW3\tJW35\tJW39\tJW43\tJW45\tJW47\tJW49\tJW5\tJW52\tJW54\tJW56\tJW57\tJW59\tJW66\tJW7\tJW70\tJW75\tJW77\tJW9\tREF_OR74A']
With some kludge and list slicing / string methods, I can manipulate this and get my data out. What I'd really like to do is have findall yield a list of just these entries. What would the regular expression look like to do that?
Do you need to use regex? List slicing and string methods don't appear to be as much of a kludge as you say.
something like:
f = open('src.txt','r')
for line in f:
if line.startswith("#position"):
l = line.split() # with no arguments it splits on all whitespace characters
l = l[1:] # get rid of the "#position" tag
break
and further manipulate from there?

Python - trying to capture the middle of a line, regex or split

I have a text file with some names and emails and other stuff. I want to capture email addresses.
I don't know whether this is a split or regex problem.
Here are some sample lines:
[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79
I want to be able to do a loop that prints all the email addresses.
Thanks.
I'd use a regex:
import re
data = '''[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81
[name]mark hilly [email]mark.hilly#hotmail.com [dob]02.11.80
[name]gill silly [email]gill.silly#hotmail.com [dob]03.12.79'''
group_matcher = re.compile(r'\[(.*?)\]([^\[]+)')
for line in data.split('\n'):
o = dict(group_matcher.findall(line))
print o['email']
\[ is literally [.
(.*?) is a non-greedy capturing group. It "expands" to capture the text.
\] is literally ]
( is the beginning of a capturing group.
[^\[] matches anything but a [.
+ repeats the last pattern any number of times.
) closes the capturing group.
for line in lines:
print line.split("]")[2].split(" ")[0]
You can pass substrings to split, not just single characters, so:
email = line.partition('[email]')[-1].partition('[')[0].rstrip()
This has an advantage over the simple split solutions that it will work on fields that can have spaces in the value, on lines that have things in a different order (even if they have [email] as the last field), etc.
To generalize it:
def get_field(line, field):
return line.partition('[{}]'.format(field)][-1].partition('[')[0].rstrip()
However, I think it's still more complicated than the regex solution. Plus, it can only search for exactly one field at a time, instead of all fields at once (without making it even more complicated). To get two fields, you'll end up parsing each line twice, like this:
for line in data.splitlines():
print '''{} "babysat" Dan O'Brien on {}'''.format(get_field(line, 'name'),
get_field(line, 'dob'))
(I may have misinterpreted the DOB field, of course.)
You can split by space and then search for the element that starts with [email]:
line = '[name]bill billy [email]bill.billy#hotmail.com [dob]01.01.81'
items = line.split()
for item in items:
if item.startswith('[email]'):
print item.replace('[email]', '', 1)
say you have a file with lines.
import re
f = open("logfile", "r")
data = f.read()
for line in data.split("\n"):
match=re.search("email\](?P<id>.*)\[dob", line)
if match:
# either store or print the emails as you like
print match.group('id').strip(), "\n"
Thats all (try it, for python 3 n above remember print is a function make those changes ) !
The output from your sample data:
bill.billy#hotmail.com
mark.hilly#hotmail.com
gill.silly#hotmail.com
>>>

Editing lines and removing lines from file

I have a file of accession numbers and 16S rrna sequences, and what I'm trying to do is remove all lines of RNA, and only keep the lines with the accession numbers and the species name (and remove all the junk in between). So my input file looks like this (there are > in front of the accession numbers):
> D50541 1 1409 1409bp rna Abiotrophia defectiva Aerococcaceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACCGAAGCAU CUUCGGAUGC UUAGUGGCGA ACGGGUGAGU AACACGUAGA
UAACCUACCC UAGACUCGAG GAUAACUCCG GGAAACUGGA GCUAAUACUG GAUAGGAUAU AGAGAUAAUU UCUUUAUAUU
(... and many more lines)
> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae
CUGGCGGCGU GCCUAAUACA UGCAAGUCGA ACGCUCUAUA GCAAUAUAGG GAGUGGCGAA CGGGUGAGUA ACACGUAGAU
AACCUACCCU UACUUCGAGG AUAACUUCGG GAAACUGGAG CUAAUACUGG AUAGGACAUA UUGAGGCAUC UUAAUAUGUU
...
I want my output to look like this:
>D50541 Abiotrophia defectiva Aerococcaceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
The code I wrote does what I want... for most of the lines. It looks like this:
#!/usr/bin/env python
# take LTPs111.compressed fasta and reduce to accession numbers with names.
import re
infilename = 'LTPs111.compressed.fasta'
outfilename = 'acs.fasta'
regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
#remove extra letters and spaces
with open(infilename, 'r') as infile, open(outfilename, 'w') as outfile:
for line in infile:
x = regex.sub(r'\1\2 \3', line)
#remove rna sequences
for line in x:
if '>' in line:
outfile.write(x)
Sometimes, the code seems to skip over some of the names. for example, for the first accession number above, I only got back:
>D50541 Aerococcaceae
Why might my code be doing this? The input for each accession number looks identical, and the spacing between 'rna' and the first name is the same for each line (5 spaces).
Thank you to anyone who might have some ideas!
I still haven't been able to run your code to get the claimed results, but I think I know what the problem is:
>>> line = '> AY538167 1 1411 1411bp rna Acholeplasma hippikon Acholeplasmataceae'
>>> regex = re.compile(r'(>)\s(\w+).+[rna]\s+([A-Z].+)')
>>> regex.findall(line)
[('>', 'AY538167', 'Acholeplasmataceae')]
The problem is that [rna]\s+ matches any one of the characters r, n, or a at the end of a word. And, because all of the matches are greedy, with no lookahead or anything else to prevent it, this means that it matches the n at the end of hippikon.
The simple solution is to remove the brackets, so it matches the string rna:
>>> regex = re.compile(r'(>)\s(\w+).+rna\s+([A-Z].+)')
That won't work if any of your species or genera can end with that string. Are there any such names? If so, you need to come up with a better way to describe the cutoff between the 1409bp part and the rna part. The simplest may be to just look for rna surrounded by spaces:
>>> regex = re.compile(r'(>)\s(\w+).+\s+rna\s+([A-Z].+)')
Whether this is actually correct or not, I can't say without knowing more about the format, but hopefully you understand what I'm doing well enough to verify that it's correct (or at least to ask smarter questions than I can ask).
It may help debug things to add capture groups. For example, instead of this:
(>)\s(\w+).+[rna]\s+([A-Z].+)
… search for this:
(>)(\s)(\w+)(.+[rna]\s+)([A-Z].+)
Obviously your desired capture groups are now \1\3 \5 instead of \1\2 \3… but the big thing is that you can see what got matched in \4:
[('>', ' ', 'AY538167', ' 1 1411 1411bp Acholeplasma hippikon ', 'Acholeplasmataceae')]
So, now the question is "Why did .+[rna]\s+ match '1 1411 1411bp Acholeplasma hippikon '? Sometimes the context matters, but in this case, it doesn't. You don't want that group to match that string in any context, and yet it will always match it, so that's the part you have to debug.
Also, a visual regexp explorer often helps a lot. The best ones can color parts of the expression and the matched text, etc., to show you how and why the regexp is doing what it does.
Of course you're limited by those that run on your platform or online, and work with Python syntax. If you're careful and/or only use simple features (as in your example), perl/PCRE syntax is very close to Python, and JavaScript/ActionScript is also pretty close (the one big difference to keep in mind is that replace/sub uses $ instead of \1).
I don't have a good online one to strongly recommend, but from a quick glance Debuggex looks pretty cool.
Items between brackets are character classes, so by setting your regex to look for "[rna]" you are requesting lines with either r, n, or a, but not all three.
Further, if the lines you want all have the pattern "bp rna", I'd use that to yank those lines. By reading the file in line by line, the following worked for me for a quick and dirty line-yanker, for instance:
regex = re.compile(r'^[\w\s]+bp rna .*$')
But, again, if it's as simple as finding lines with "bp rna" in them, you could read the file line by line and forego regex entirely:
for line in file:
if "bp rna" in line:
print(line)
EDIT: I blew it by not reading the request carefully enough. Maybe a capture-and-replace regex would help?
for line in file:
if "bp rna" in line:
subreg = re.sub(r'^(>[\w]+)\s[\d\s]+bp\srna\s([\w\s]+$)', r"\1 \2", line)
print(subreg)
OUTPUT:
>AY538166 Acholeplasma granularum Acholeplasmataceae
>AY538167 Acholeplasma hippikon Acholeplasmataceae
This should match any whitespace (tabs or spaces) between the things you want.

python substitute a substring with one character less

I need to process lines having a syntax similar to markdown http://daringfireball.net/projects/markdown/syntax, where header lines in my case are something like:
=== a sample header ===
===== a deeper header =====
and I need to change their depth, i.e. reduce it (or increase it) so:
== a sample header ==
==== a deeper header ====
my small knowledge of python regexes is not enough to understand how to replace a number
n of '=' 's with (n-1) '=' signs
You could use backreferences and two negative lookarounds to find two corresponding sets of = characters.
output = re.sub(r'(?<!=)=(=+)(.*?)=\1(?!=)', r'\1\2\1', input)
That will also work if you have a longer string that contains multiple headers (and will change all of them).
What does the regex do?
(?<!=) # make sure there is no preceding =
= # match a literal =
( # start capturing group 1
=+ # match one or more =
) # end capturing group 1
( # start capturing group 2
.*? # match zero or more characters, but as few as possible (due to ?)
) # end capturing group 2
= # match a =
\1 # match exactly what was matched with group 1 (i.e. the same amount of =)
(?!=) # make sure there is no trailing =
No need for regexes. I would go very simple and direct:
import sys
for line in sys.stdin:
trimmed = line.strip()
if len(trimmed) >= 2 and trimmed[0] == '=' and trimmed[-1] == '=':
print(trimmed[1:-1])
else:
print line.rstrip()
The initial strip is useful because in Markdown people sometimes leave blank spaces at the end of a line (and maybe the beginning). Adjust accordingly to meet your requirements.
Here is a live demo.
I think it can be as simple as replacing '=(=+)' with \1 .
Is there any reason for not doing so?
how about a simple solution?
lines = ['=== a sample header ===', '===== a deeper header =====']
new_lines = []
for line in lines:
if line.startswith('==') and line.endswith('=='):
new_lines.append(line[1:-1])
results:
['== a sample header ==', '==== a deeper header ====']
or in one line:
new_lines = [line[1:-1] for line in lines if line.startswith('==') and line.endswith('==')]
the logic here is that if it starts and ends with '==' then it must have at least that many, so when we remove/trim each side, we are left with at least '=' on each side.
this will work as long as each 'line' starts and ends with its '==....' and if you are using these as headers, then they will be as long as you strip the newlines off.
either the first header or the second header,you can just use string replace like this
s = "=== a sample header ==="
s.replace("= "," ")
s.replace(" ="," ")
you can also deal with the second header like this
btw:you can also use the sub function of the re module,but it's not necessory

Categories