How to grab number after word in python - python

I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112 and DDD-1081N|uniprotkb:P12121, I want to grab the number after uniprotkb.
Here's my code:
x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
print line.find(x)
print line[36:31 + len(x)]
The problem in line.find(x) is 10 and 26, I grab the complete number when it is 26. I'm new to programming, so I'm looking for something to grab the complete number after the word.
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print the number after x

Use regular expressions:
import re
for line in open('m.txt'):
match = re.search('uniprotkb:P(\d+)', line)
if match:
print match.group(1)

import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)

The re module is quite unnecessary here if x is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"):
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    print line[line.find(x)+len(x):]
Edit:
To answer you comment. If they are separated by the pipe character (|), then you could do this:
sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
print matches
If m.txt has the following line:
DDD-1126N|uniprotkb:285726|uniprotkb:P00112
Then the above will output:
['285726', 'P00112']
Replace sep = "|" with whatever the column separator would be.

Um, for one thing I'd suggest you use the csv module to read a TSV file.
But generally, you can use a regular expression:
import re
regex = re.compile(r"(?<=\buniprotkb:)\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
The regular expression matches a string of alphanumeric characters if it's preceded by uniprotkb:.

Related

How to ignore the next same word as the first in a text using python? [duplicate]

I have a huge file containing the following lines DDD-1126N|refseq:NP_285726|uniprotkb:P00112 and DDD-1081N|uniprotkb:P12121, I want to grab the number after uniprotkb.
Here's my code:
x = 'uniprotkb:P'
f = open('m.txt')
for line in f:
print line.find(x)
print line[36:31 + len(x)]
The problem in line.find(x) is 10 and 26, I grab the complete number when it is 26. I'm new to programming, so I'm looking for something to grab the complete number after the word.
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
if x in line:
print the number after x
Use regular expressions:
import re
for line in open('m.txt'):
match = re.search('uniprotkb:P(\d+)', line)
if match:
print match.group(1)
import re
regex = re.compile('uniprotkb:P([0-9]*)')
print regex.findall(string)
The re module is quite unnecessary here if x is static and always matches a substring at the end of each line (like "DDD-1126N|refseq:NP_285726|uniprotkb:P00112"):
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    print line[line.find(x)+len(x):]
Edit:
To answer you comment. If they are separated by the pipe character (|), then you could do this:
sep = "|"
x = 'uniprotkb:'
f = open('m.txt')
for line in f:
  if x in line:
    matches = [l[l.find(x)+len(x):] for l in line.split(sep) if l[l.find(x)+len(x):]]
print matches
If m.txt has the following line:
DDD-1126N|uniprotkb:285726|uniprotkb:P00112
Then the above will output:
['285726', 'P00112']
Replace sep = "|" with whatever the column separator would be.
Um, for one thing I'd suggest you use the csv module to read a TSV file.
But generally, you can use a regular expression:
import re
regex = re.compile(r"(?<=\buniprotkb:)\w+")
for line in f:
match = regex.search(line)
if match:
print match.group()
The regular expression matches a string of alphanumeric characters if it's preceded by uniprotkb:.

Stripping numbers dates until first alphabet is found from string

I am trying an efficient way to strip numbers dates or any other characters present in a string until the first alphabet is found from the end.
string - '12.abd23yahoo 04/44 231'
Output - '12.abd23yahoo'
line_inp = "12.abd23yahoo 04/44 231"
line_out = line_inp.rstrip('0123456789./')
This rstrip() call doesn't seem to work as expected, I get '12.abd23yahoo 04/44 ' instead.
I am trying below and it doesn't seem to be working.
for fname in filenames:
with open(fname) as infile:
for line in infile:
outfile.write(line.rstrip('0123456789./ '))
You need to strip spaces too:
line_out = line_inp.rstrip('0123456789./ ')
Demo:
>>> line_inp = "12.abd23yahoo 04/44 231"
>>> line_inp.rstrip('0123456789./ ')
'12.abd23yahoo'
You need to strip the newlines and add it again before you write :
for fname in filenames:
with open(fname) as infile:
outfile.writelines(line.rstrip('0123456789./ \n') + "\n"
for line in infile)
If the format is always the same you can just split:
with open(fname) as infile:
outfile.writelines(line.split(None, 1)[0] + "\n"
for line in infile)
Here's a solution using a regular expression:
import re
line_inp = "12.abd23yahoo 04/44 231"
r = re.compile('^(.*[a-zA-Z])')
m = re.match(r, line_inp)
line_out = m.group(0) # 12.abd23yahoo
The regular expression matches a group of arbitrary characters which end in a letter.

Python - Replace parenthesis with periods and remove first and last period

I am trying to take an input file with a list of DNS lookups that contains subdomain/domain separators with the string length in parenthesis as opposed to periods. It looks like this:
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
(8)subdomain(5)domain(3)com(0)
I would like to replace the parenthesis and numbers with periods and then remove the first and last period. My code currently does this, but leaves the last period. Any help is appreciated. Here is the code:
import re
file = open('test.txt', 'rb')
writer = open('outfile.txt', 'wb')
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line)
if newline1.startswith('.'):
newline1 = newline1[1:-1]
writer.write(newline1)
You can split the lines with \(\d+\) regex and then join with . stripping commas at both ends:
for line in file:
res =".".join(re.split(r'\(\d+\)', line))
writer.write(res.strip('.'))
See IDEONE demo
Given that your re.sub call works like this:
> re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
'.subdomain.domain.com.'
the only thing you need to do is strip the resulting string from any leading and trailing .:
> s = re.sub(r"\(\d+\)",".", "(8)subdomain(5)domain(3)com(0)")
> s.strip(".")
'subdomain.domain.com'
Full drop in solution:
for line in file:
newline1 = re.sub(r"\(\d+\)",".",line).strip(".")
writer.write(newline1)
import re
def repl(matchobj):
if matchobj.group(1):
return "."
else:
return ""
x="(8)subdomain(5)domain(3)com(0)"
print re.sub(r"^\(\d+\)|((?<!^)\(\d+\))(?!$)|\(\d+\)$",repl,x)
Output:subdomain.domain.com.
You can define your own replace function.
import re
for line in file:
line = re.sub(r'\(\d\)','.',line)
line = line.strip('.')

printing dates in a log file

I have a log file with a consistent date in my log file.
ex:
date1
date2
...
The date means the number of logs in my log file. I was wondering how can print the dates from the log file using Regular expressions
what I have tried:
import re
dateRegex = re.compile('^\w{3}\s\d\d:\d\d:\d\d')
f = open("logfile.log","r")
for line in f.readlines():
matches = re.findall(dateRegex,line)
print matches
The output I am getting is (many []):
[]
[]
[]
...
...
You seem to have forgotten the date:
import re
dateRegex = re.compile(r'^\w{3}\s\d\d?\s\d\d:\d\d:\d\d')
# ^^^^^^^ I added ? to cater for dates between 1 & 9
f = open("logfile.log","r")
for line in f.readlines():
matches = re.findall(dateRegex,line)
if matches: # Check if there are matches
print matches[0] # Print first element of list returned by findall
I think that you can use re.match instead, since you're testing line by line and using the beginning of line anchor:
import re
dateRegex = re.compile(r'\w{3}\s\d\d?\s\d\d:\d\d:\d\d')
f = open("logfile.log","r")
for line in f.readlines():
matches = re.match(dateRegex,line)
if matches:
print matches.group()

How to eliminate last digit from each of the top lines

Sequence 1.1.1 ATGCGCGCGATAAGGCGCTA
ATATTATAGCGCGCGCGCGGATATATATATATATATATATT
Sequence 1.2.2 ATATGCGCGCGCGCGCGGCG
ACCCCGCGCGCGCGCGGCGCGATATATATATATATATATATT
Sequence 2.1.1 ATTCGCGCGAGTATAGCGGCG
NOW,I would like to remove the last digit from each of the line that starts with '>'. For example, in this first line, i would like to remove '.1' (rightmost) and in second instance i would like to remove '.2' and then write the rest of the file to a new file. Thanks,
import fileinput
import re
for line in fileinput.input(inplace=True, backup='.bak'):
line = line.rstrip()
if line.startswith('>'):
line = re.sub(r'\.\d$', '', line)
print line
many details can be changed depending on details of the processing you want, which you have not clearly communicated, but this is the general idea.
import re
trimmedtext = re.sub(r'(\d+\.\d+)\.\d', '$1', text)
Should do it. Somewhat simpler than searching for start characters (and it won't effect your DNA chains)
if line.startswith('>Sequence'):
line = line[:-2] # trim 2 characters from the end of the string
or if there could be more than one digit after the period:
if line.startswith('>Sequence'):
dot_pos = line.rfind('.') # find position of rightmost period
line = line[:dot_pos] # truncate upto but not including the dot
Edit for if the sequence occurs on the same line as >Sequence
If we know that there will always be only 1 digit to remove we can cut out the period and the digit with:
line = line[:13] + line[15:]
This is using a feature of Python called slices. The indexes are zero-based and exclusive for the end of the range so line[0:13] will give us the first 13 characters of line. Except that if we want to start at the beginning the 0 is optional so line[:13] does the same thing. Similarly line[15:] gives us the substring starting at character 15 to the end of the string.
map "".join(line.split('.')[:-1]) to each line of the file.
Here's a short script. Run it like: script [filename to clean]. Lots of error handling omitted.
It operates using generators, so it should work fine on huge files as well.
import sys
import os
def clean_line(line):
if line.startswith(">"):
return line.rstrip()[:-2]
else:
return line.rstrip()
def clean(input):
for line in input:
yield clean_line(line)
if __name__ == "__main__":
filename = sys.argv[1]
print "Cleaning %s; output to %s.." % (filename, filename + ".clean")
input = None
output = None
try:
input = open(filename, "r")
output = open(filename + ".clean", "w")
for line in clean(input):
output.write(line + os.linesep)
print ": " + line
except:
input.close()
if output != None:
output.close()
import re
input_file = open('in')
output_file = open('out', 'w')
for line in input_file:
line = re.sub(r'(\d+[.]\d+)[.]\d+', r'\1', line)
output_file.write(line)

Categories