Replacement for isAlpha() to include underscores? - python

I am processing data using Python3 and I need to read a results file that looks like this:
ENERGY_BOUNDS
1.964033E+07 1.733253E+07 1.491825E+07 1.384031E+07 1.161834E+07 1.000000E+07 8.187308E+06 6.703200E+06
6.065307E+06 5.488116E+06 4.493290E+06 3.678794E+06 3.011942E+06 2.465970E+06 2.231302E+06 2.018965E+06
GAMMA_INTERFACE
0
EIGENVALUE
1.219034E+00
I want to search the file for a specific identifier (in this case ENERGY_BOUNDS), begin reading the numeric values after this identifier but not the identifier itself, and stop when I reach the next identifier. However, my problem is that I was using isAlpha to find the next identifier, and some of them contain underscores. Here is my code:
def read_data_from_file(file_name, identifier):
with open(file_name, 'r') as read_obj:
list_of_results = []
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
if identifier in line:
# If yes, read the next line
nextValue = next(read_obj)
while(not nextValue.strip().isalpha()): # Keep on reading until next identifier appears
list_of_results.extend(nextValue.split())
nextValue = next(read_obj)
return(list_of_results)
I think I need to use regex, but I am stuck regarding how to phrase it. Any help would be much appreciated!

take = False
with open('path/to/input') as infile:
for line in input:
if line.strip() == "ENERGY_BOUNDS":
take = True
continue # we don't actually want this line
if all(char.isalpha() or char=="_" for char in line.strip()): # we've hit the next section
take = False
if take:
print(line) # or whatever else you want to do with this line

Here's an option for you.
Just iterate over the file until you hit the identifier.
Then iterate over it in another for loop until the next identifier causes a ValueError.
def read_data_from_file(file_name, identifier):
with open(file_name, 'r') as f:
list_of_results = []
for line in f:
if identifier in line:
break
for line in f:
try:
list_of_results.extend(map(float, line.split()))
except ValueError:
break
return list_of_results

You can use this regex: ^[A-Z]+(?:_[A-Z]+)*$
Additionally, you can modify the regex to match strings of custom length, like this: ^[A-Z]{2,10}+(?:_[A-Z]+)*$, where {2, 10} is {MIN, MAX} length:
You can find this demo here: https://regex101.com/r/9jESAH/35
See this answer for more details.

Here is a simple function to verify a string has alpha, uppercase and lowercase and underscore:
RE_PY_VAR_NAME="^[a-zA-Z_]+$"
def isAlphaUscore(s:str) -> bool:
assert not s is None, "s cannot be None"
return re.search(RE_PY_VAR_NAME, s)

Related

Multiple strings replacement from dictionary

I'm going to create script which will take the key and value from the dictionary and use it for replacement in set of files.
I need replace "foo" to "foo1" if "target_value" found in the file. There are many different foo's. So, I guess dictionary will be suitable for that.
I started from the simple things:
with fileinput.FileInput(filelist, inplace=True) as file:
for line in file:
if line.find('target_value'):
print(line.replace("foo", "foo1"), end='')
For some reason this script just ignore line.find and replace everything with last line of code.
Could you help?
Python's find command returns -1 if the value is not found, so you need something like:
with fileinput.FileInput(filelist, inplace=True) as file:
for line in file:
if line.find('target_value') > -1:
line = line.replace("foo", "foo1")
print(line, end='')
The trouble with str.find is that it returns the index at which "target_value" occurs in line, so any integer from 0 to len(line)-len(target_value)-1. That is, unless, "target_value" doesn't exist in line; then str.find returns -1 but the value of bool(-1) is True. In fact, the only time line.find('target_value') is False is when "target_value" is the first part of line.
There are a couple options:
with fileinput.FileInput(filelist, inplace=True) as file:
for line in file:
if line.find('target_value') != -1:
print(line.replace("foo", "foo1"), end='')
Or:
with fileinput.FileInput(filelist, inplace=True) as file:
for line in file:
if 'target_value' in line:
print(line.replace("foo", "foo1"), end='')
The second option is more readable and tends to perform better when line is long and "target_value" doesn't occur in the beginning of line.
>>> timeit('"target_value" in s', setup='s = "".join("foobar baz" for _ in range(100))+"target_value"+ "".join("foobar baz" for _ in range(100))')
0.20444475099975534
>>> timeit('s.find("target_value") != -1', setup='s = "".join("foobar baz" for _ in range(100))+"target_value"+ "".join("foobar baz" for _ in range(100))')
0.30517548999978317
Instead of .find(), you can use if 'target_value' in line: which is more expressive and does not involve return value conventions.
If you have multiple target keywords (and multiple replacements per target), you could build your dictionary in like this
replacements = { 'target_value1': [('foo','foo1'), ('bar','bar1')],
'target_value2': [('oof','oof'), ('rab','rab2')],
'target_value3': [('foobar','foosy'),('barfoo','barry')]}
Then find which target value(s) are present and perform the corresponding replacements:
with open(fileName,'r') as f:
content = f.read()
for target,maps in replacements.items(): # go through target keywords
if target not in content: continue # check target (whole file)
for fromString,toString in maps: # perform replacements for target
content = content.replace(fromString,toString)
# you probably want to save the new content at this point
with open(fileName,'w') as f:
f.write(content)
Note that, in this example, I assumed that the target keyword flags the whole file (not each line). If the target keyword is specific to each line, you will need to break down the content into lines and place the logic inside a loop on lines to perform the replacements on a line by line basis
You don't actually need the replacement data to be a dictionary (this second example with line level single target replacements):
replacements = [ ['target_value1',('foo','foo1'), ('bar','bar1')],
['target_value2',('oof','oof'), ('rab','rab2')],
['target_value3',('foobar','foosy'),('barfoo','barry')]]
with open(fileName,'r') as f:
lines = f.read().split("\n")
for i,line in enumerate(lines): # for each line
for target,*maps in replacements: # go through target keywords
if target not in line: continue # check target (line level)
for fromString,toString in maps: # perform replacements on line
lines[i] = lines[i].replace(fromString,toString)
break # only process one keyword per line
# to save changes
with open(fileName,'w') as f:
f.write("\n".join(lines))

How to read a value of file separate by tabs in Python?

I have a text file with this format
ConfigFile 1.1
;
; Version: 4.0.32.1
; Date="2021/04/08" Time="11:54:46" UTC="8"
;
Name
John Legend
Type
Student
Number
s1054520
I would like to get the value of Name or Type or Number
How do I get it?
I tried with this method, but it does not solve my problem.
import re
f = open("Data.txt", "r")
file = f.read()
Name = re.findall("Name", file)
print(Name)
My expectation output is John Legend
Anyone can help me please. I really appreciated. Thank you
First of all re.findall is used to search for “all” occurrences that match a given pattern. So in your case. you are finding every "Name" in the file. Because that's what you are looking for.
On the other hand, the computer will not know the "John Legend" is the name. it will only know that's the line after the word "Name".
In your case I will suggest you can check this link.
Find the "Name"'s line number
Read the next line
Get the name without the white space
If there is more than 1 Name. this will work as well
the final code is like this
def search_string_in_file(file_name, string_to_search):
"""Search for the given string in file and return lines containing that string,
along with line numbers"""
line_number = 0
list_of_results = []
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if string_to_search in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
# Return list of tuples containing line numbers and lines where string is found
return list_of_results
file = open('Data.txt')
content = file.readlines()
matched_lines = search_string_in_file('Data.txt', 'Name')
print('Total Matched lines : ', len(matched_lines))
for i in matched_lines:
print(content[i[0]].strip())
Here I'm going through each line and when I encounter Name I will add the next line (you can directly print too) to the result list:
import re
def print_hi(name):
result = []
regexp = re.compile(r'Name*')
gotname = False;
with open('test.txt') as f:
for line in f:
if gotname:
result.append(line.strip())
gotname = False
match = regexp.match(line)
if match:
gotname = True
print(result)
if __name__ == '__main__':
print_hi('test')
Assuming those label lines are in the sequence found in the file you
can simply scan for them:
labelList = ["Name","Type","Number"]
captures = dict()
with open("Data.txt","rt") as f:
for label in labelList:
while not f.readline().startswith(label):
pass
captures[label] = f.readline().strip()
for label in labelList:
print(f"{label} : {captures[label]}")
I wouldn't use a regex, but rather make a parser for the file type. The rules might be:
The first line can be ignored
Any lines that start with ; can be ignored.
Every line with no leading whitespace is a key
Every line with leading whitespace is a value belonging to the last
key
I'd start with a generator that can return to you any unignored line:
def read_data_lines(filename):
with open(filename, "r") as f:
# skip the first line
f.readline()
# read until no more lines
while line := f.readline():
# skip lines that start with ;
if not line.startswith(";"):
yield line
Then fill up a dict by following rules 3 and 4:
def parse_data_file(filename):
data = {}
key = None
for line in read_data_lines(filename):
# No starting whitespace makes this a key
if not line.startswith(" "):
key = line.strip()
# Starting whitespace makes this a value for the last key
else:
data[key] = line.strip()
return data
Now at this point you can parse the file and print whatever key you want:
data = parse_data_file("Data.txt")
print(data["Name"])

Return value in a quite nested for-loop

I want nested loops to test whether all elements match the condition and then to return True. Example:
There's a given text file: file.txt, which includes lines of this pattern:
aaa:bb3:3
fff:cc3:4
Letters, colon, alphanumeric, colon, integer, newline.
Generally, I want to test whether all lines matches this pattern. However, in this function I would like to check whether the first column includes only letters.
def opener(file):
#Opens a file and creates a list of lines
fi=open(file).read().splitlines()
import string
res = True
for i in fi:
#Checks whether any characters in the first column is not a letter
if any(j not in string.ascii_letters for j in i.split(':')[0]):
res = False
else:
continue
return res
However, the function returns False even if all characters in the first column are letters. I would like to ask you for the explanation, too.
Your code evaluates the empty line after your code - hence False :
Your file contains a newline after its last line, hence your code checks the line after your last data which does not fullfill your test- that is why you get False no matter the input:
aaa:bb3:3
fff:cc3:4
empty line that does not start with only letters
You can fix it if you "spezial treat" empty lines if they occur at the end. If you have an empty line in between filled ones you return False as well:
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
fff:cc3:4
""")
import string
def opener(file):
letters = string.ascii_letters
# Opens a file and creates a list of lines
with open(file) as fi:
res = True
empty_line_found = False
for i in fi:
if i.strip(): # only check line if not empty
if empty_line_found: # we had an empty line and now a filled line: error
return False
#Checks whether any characters in the first column is not a letter
if any(j not in letters for j in i.strip().split(':')[0]):
return False # immediately exit - no need to test the rest of the file
else:
empty_line_found = True
return res # or True
print (opener("t.txt"))
Output:
True
If you use
# example with a file that contains an empty line between data lines - NOT ok
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
fff:cc3:4
""")
or
# example for file that contains empty line after data - which is ok
with open("t.txt","w") as f:
f.write("""aaa:bb3:3
ff2f:cc3:4
""")
you get: False
Colonoscopy
ASCII, and UNICODE, both define character 0x3A as COLON. This character looks like two dots, one over the other: :
ASCII, and UNICODE, both define character 0x3B as SEMICOLON. This character looks like a dot over a comma: ;
You were consistent in your use of the colon in your example: fff:cc3:4 and you were consistent in your use of the word semicolon in your descriptive text: Letters, semicolon, alphanumeric, semicolon, integer, newline.
I'm going to assume you meant colon (':') since that is the character you typed. If not, you should change it to a semicolon (';') everywhere necessary.
Your Code
Here is your code, for reference:
def opener(file):
#Opens a file and creates a list of lines
fi=open(file).read().splitlines()
import string
res = True
for i in fi:
#Checks whether any characters in the first column is not a letter
if any(j not in string.ascii_letters for j in i.split(':')[0]):
res = False
else:
continue
return res
Your Problem
The problem you asked about was the function always returning false. The example you gave included a blank line between the first example and the second. I would caution you to watch out for spaces or tabs in those blank lines. You can fix this by explicitly catching blank lines and skipping over them:
for i in fi:
if i.isspace():
# skip blank lines
continue
Some Other Problems
Now here are some other things you might not have noticed:
You provided a nice comment in your function. That should have been a docstring:
def opener(file):
""" Opens a file and creates a list of lines.
"""
You import string in the middle of your function. Don't do that. Move the import
up to the top of the module:
import string # at top of file
def opener(file): # Not at top of file
You opened the file with open() and never closed it. This is exactly why the with keyword was added to python:
with open(file) as infile:
fi = infile.read().splitlines()
You opened the file, read its entire contents into memory, then split it into lines
discarding the newlines at the end. All so that you could split it by colons and ignore
everything but the first field.
It would have been simpler to just call readlines() on the file:
with open(file) as infile:
fi = infile.readlines()
res = True
for i in fi:
It would have been even easier and even simpler to just iterate on the file directly:
with open(file) as infile:
res = True
for i in infile:
It seems like you are building up towards checking the entire format you gave at the beginning. I suspect a regular expression would be (1) easier to write and maintain; (2) easier to understand later; and (3) faster to execute. Both now, for this simple case, and later when you have more rules in place:
import logging
import re
bad_lines = 0
for line in infile:
if line.isspace():
continue
if re.match(valid_line, line):
continue
logging.warn(f"Bad line: {line}")
bad_lines += 1
return bad_lines == 0
Your names are bad. Your function includes the names file, fi, i, j, and res. The only one that barely makes sense is file.
Considering that you are asking people to read your code and help you find a problem, please, please use better names. If you just replaced those names with file (same), infile, line, ch, and result the code gets more readable. If you restructured the code using standard Python best practices, like with, it gets even more readable. (And has fewer bugs!)

Match an element of every line

I have a list of rules for a given input file for my function. If any of them are violated in the file given, I want my program to return an error message and quit.
Every gene in the file should be on the same chromosome
Thus for a lines such as:
NM_001003443 chr11 + 5997152 5927598 5921052 5926098 1 5928752,5925972, 5927204,5396098,
NM_001003444 chr11 + 5925152 5926098 5925152 5926098 2 5925152,5925652, 5925404,5926098,
NM_001003489 chr11 + 5925145 5926093 5925115 5926045 4 5925151,5925762, 5987404,5908098,
etc.
Each line in the file will be variations of this line
Thus, I want to make sure every line in the file is on chr11
Yet I may be given a file with a different list of chr(and any number of numbers). Thus I want to write a function that will make sure whatever number is found on chr in the line is the same for every line.
Should I use a regular expression for this, or what should I do? This is in python by the way.
Such as: chr\d+ ?
I am unsure how to make sure that whatever is matched is the same in every line though...
I currently have:
from re import *
for line in file:
r = 'chr\d+'
i = search(r, line)
if i in line:
but I don't know how to make sure it is the same in every line...
In reference to sajattack's answer
fp = open(infile, 'r')
for line in fp:
filestring = ''
filestring +=line
chrlist = search('chr\d+', filestring)
chrlist = chrlist.group()
for chr in chrlist:
if chr != chrlist[0]:
print('Every gene in file not on same chromosome')
Just read the file and have a while loop check each line to make sure it contains chr11. There are string functions to search for substrings in a string. As soon as you find a line that returns false (does not contain chr11) then break out of the loop and set a flag valid = false.
import re
fp = open(infile, 'r')
fp.readline()
tar = re.findall(r'chr\d+', fp.readline())[0]
for line in fp:
if (line.find(tar) == -1):
print("Not valid")
break
This should search for a number in the line and check for validity.
Is it safe to assume that the first chr is the correct one? If so, use this:
import re
chrlist = re.findall("chr[0-9]+", open('file').read())
# ^ this is a list with all chr(whatever numbers)
for chr in chrlist:
if chr != chrlist[0]
print("Chr does not match")
break
My solution uses a "match group" to collect the matched numbers from the "chr" string.
import re
pat = re.compile(r'\schr(\d+)\s')
def chr_val(line):
m = re.search(pat, line)
if m is not None:
return m.group(1)
else:
return ''
def is_valid(f):
line = f.readline()
v = chr_val(line)
if not v:
return False
return all(chr_val(line) == v for line in f)
with open("test.txt", "r") as f:
print("The file is {0}".format("valid" if is_valid(f) else "NOT valid"))
Notes:
Pre-compiles the regular expression for speed.
Uses a raw string (r'') to specify the regular expression.
The pattern requires white space (\s) on either side of the chr string.
is_valid() returns False if the first line doesn't have a good chr value. Then it returns a Boolean value that is true if all of the following lines match the chr value of the first line.
Your sample code just prints something like The file is True so I made it a bit friendlier.

How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.

Categories