How to read a value of file separate by tabs in Python? - python

I have a text file with this format
ConfigFile 1.1
;
; Version: 4.0.32.1
; Date="2021/04/08" Time="11:54:46" UTC="8"
;
Name
John Legend
Type
Student
Number
s1054520
I would like to get the value of Name or Type or Number
How do I get it?
I tried with this method, but it does not solve my problem.
import re
f = open("Data.txt", "r")
file = f.read()
Name = re.findall("Name", file)
print(Name)
My expectation output is John Legend
Anyone can help me please. I really appreciated. Thank you

First of all re.findall is used to search for “all” occurrences that match a given pattern. So in your case. you are finding every "Name" in the file. Because that's what you are looking for.
On the other hand, the computer will not know the "John Legend" is the name. it will only know that's the line after the word "Name".
In your case I will suggest you can check this link.
Find the "Name"'s line number
Read the next line
Get the name without the white space
If there is more than 1 Name. this will work as well
the final code is like this
def search_string_in_file(file_name, string_to_search):
"""Search for the given string in file and return lines containing that string,
along with line numbers"""
line_number = 0
list_of_results = []
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if string_to_search in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
# Return list of tuples containing line numbers and lines where string is found
return list_of_results
file = open('Data.txt')
content = file.readlines()
matched_lines = search_string_in_file('Data.txt', 'Name')
print('Total Matched lines : ', len(matched_lines))
for i in matched_lines:
print(content[i[0]].strip())

Here I'm going through each line and when I encounter Name I will add the next line (you can directly print too) to the result list:
import re
def print_hi(name):
result = []
regexp = re.compile(r'Name*')
gotname = False;
with open('test.txt') as f:
for line in f:
if gotname:
result.append(line.strip())
gotname = False
match = regexp.match(line)
if match:
gotname = True
print(result)
if __name__ == '__main__':
print_hi('test')

Assuming those label lines are in the sequence found in the file you
can simply scan for them:
labelList = ["Name","Type","Number"]
captures = dict()
with open("Data.txt","rt") as f:
for label in labelList:
while not f.readline().startswith(label):
pass
captures[label] = f.readline().strip()
for label in labelList:
print(f"{label} : {captures[label]}")

I wouldn't use a regex, but rather make a parser for the file type. The rules might be:
The first line can be ignored
Any lines that start with ; can be ignored.
Every line with no leading whitespace is a key
Every line with leading whitespace is a value belonging to the last
key
I'd start with a generator that can return to you any unignored line:
def read_data_lines(filename):
with open(filename, "r") as f:
# skip the first line
f.readline()
# read until no more lines
while line := f.readline():
# skip lines that start with ;
if not line.startswith(";"):
yield line
Then fill up a dict by following rules 3 and 4:
def parse_data_file(filename):
data = {}
key = None
for line in read_data_lines(filename):
# No starting whitespace makes this a key
if not line.startswith(" "):
key = line.strip()
# Starting whitespace makes this a value for the last key
else:
data[key] = line.strip()
return data
Now at this point you can parse the file and print whatever key you want:
data = parse_data_file("Data.txt")
print(data["Name"])

Related

How can i edit several numbers/words in a txt file using python?

I want to rewrite a exisiting file with things like:
Tom A
Mike B
Jim C
to
Tom 1
Mike 2
Jim 3
The letters A,B,C can also be something else. Basicaly i want to keep the spaces between the names and what comes behind, but change them to numbers. Does someone have an idea please? Thanks a lot for your help.
I assume your first and second columns are separated by a tab (i.e. \t)?
If so, you can do this by reading the file into a list, use the split function to split each line of the file into components, edit the second component of each line, concatenate the two components back together with a tab separator and finally rewrite to a file.
For example, if test.txt is your input file:
# Create list that holds the desired output
output = [1,2,3]
# Open the file to be overwritten
with open('test.txt', 'r') as f:
# Read file into a list of strings (one string per line)
text = f.readlines()
# Open the file for writing (FYI this CLEARS the file as we specify 'w')
with open('test.txt', 'w') as f:
# Loop over lines (i.e. elements) in `text`
for i,item in enumerate(text):
# Split line into elements based on whitespace (default for `split`)
line = item.split()
# Concatenate the name and desired output with a tab separator and write to the file
f.write("%s\t%s\n" % (line[0],output[i]))
I assumed your first and second columns were separated by a spaces in the file.
You can read the file contents into a list and use the function replace_end(line,newline) and it will replace the end of the line with what you passed. then you can just write out the changed list back to the file.
""" rewrite a exisiting file """
def main():
""" main """
filename = "update_me.txt"
count = 0
lst = []
with open(filename, "r",encoding = "utf-8") as filestream:
_lines = filestream.readlines()
for line in _lines:
lst.insert(count,line.strip())
count += 1
#print(f"Line {count} {line.strip()}")
count = 0
# change the list
for line in lst:
lst[count] = replace_end(line,"ABC")
count +=1
count = 0
with open(filename, "w", encoding = "utf-8") as filestream:
for line in lst:
filestream.write(line+"\n")
count +=1
def replace_end(line,newline):
""" replace the end of a line """
return line[:-len(newline)] + newline
if __name__ == '__main__':
main()

How do I read a file line by line and print the line that have specific string only in python?

I have a text file containing these lines
wbwubddwo 7::a number1 234 **
/// 45daa;: number2 12
time 3:44
I am trying to print for example if the program find string number1, it will print 234
I start with simple script below but it did not print what I wanted.
with open("test.txt", "rb") as f:
lines = f.read()
word = ["number1", "number2", "time"]
if any(item in lines for item in word):
val1 = lines.split("number1 ", 1)[1]
print val1
This return the following result
234 **
/// 45daa;: number2 12
time 3:44
Then I tried changing f.read() to f.readlines() but this time it did not print out anything.
Does anyone know other way to do this? Eventually I want to get the value for each line for example 234, 12 and 3:44 and store it inside the database.
Thank you for your help. I really appreciate it.
Explanations given below:
with open("test.txt", "r") as f:
lines = f.readlines()
stripped_lines = [line.strip() for line in lines]
words = ["number1", "number2", "time"]
for a_line in stripped_lines:
for word in words:
if word in a_line:
number = a_line.split()[1]
print(number)
1) First of all 'rb' gives bytes object i.e something like b'number1 234' would be returned use 'r' to get string object.
2) The lines you read will be something like this and it will be stored in a list.
['number1 234\r\n', 'number2 12\r\n', '\r\n', 'time 3:44']
Notice the \r\n those specify that you have a newline. To remove use strip().
3) Take each line from stripped_lines and take each word from words
and check if that word is present in that line using in.
4)a_line would be number1 234 but we only want the number part. So split()
output of that would be
['number1','234'] and split()[1] would mean the element at index 1. (2nd element).
5) You can also check if the string is a digit using your_string.isdigit()
UPDATE: Since you updated your question and input file this works:
import time
def isTimeFormat(input):
try:
time.strptime(input, '%H:%M')
return True
except ValueError:
return False
with open("test.txt", "r") as f:
lines = f.readlines()
stripped_lines = [line.strip() for line in lines]
words = ["number1", "number2", "time"]
for a_line in stripped_lines:
for word in words:
if word in a_line:
number = a_line.split()[-1] if (a_line.split()[-1].isdigit() or isTimeFormat(a_line.split()[-1])) else a_line.split()[-2]
print(number)
why this isTimeFormat() function?
def isTimeFormat(input):
try:
time.strptime(input, '%H:%M')
return True
except ValueError:
To check if 3:44 or 4:55 is time formats. Since you are considering them as values too.
Final output:
234
12
3:44
After some try and error, I found a solution like below. This is based on answer provided by #s_vishnu
with open("test.txt", "r") as f:
lines = f.readlines()
stripped_lines = [line.strip() for line in lines]
for item in stripped_lines:
if "number1" in item:
getval = item.split("actual ")[1].split(" ")[0]
print getval
if "number2" in item:
getval2 = item.split("number2 ")[1].split(" ")[0]
print getval2
if "time" in item:
getval3 = item.split("number3 ")[1].split(" ")[0]
print getval3
output
234
12
3:44
This way, I can also do other things for example saving each data to a database.
I am open to any suggestion to further improve my answer.
You're overthinking this. Assuming you don't have those two asterisks at the end of the first line and you want to print out lines containing a certain value(s), you can just read the file line by line, check if any of the chosen values match and print out the last value (value between a space and the end of the line) - no need to parse/split the whole line at all:
search_values = ["number1", "number2", "time"] # values to search for
with open("test.txt", "r") as f: # open your file
for line in f: # read it it line by line
if any(value in line for value in search_values): # check for search_values in line
print(line[line.rfind(" ") + 1:].rstrip()) # print the last value after space
Which will give you:
234
12
3:44
If you do have asterisks you have to more precisely define your file format as splitting won't necessarily yield you your desired value.

Reading a file and storing contents into a dictionary - Python

I'm trying to store contents of a file into a dictionary and I want to return a value when I call its key. Each line of the file has two items (acronyms and corresponding phrases) that are separated by commas, and there are 585 lines. I want to store the acronyms on the left of the comma to the key, and the phrases on the right of the comma to the value. Here's what I have:
def read_file(filename):
infile = open(filename, 'r')
for line in infile:
line = line.strip() #remove newline character at end of each line
phrase = line.split(',')
newDict = {'phrase[0]':'phrase[1]'}
infile.close()
And here's what I get when I try to look up the values:
>>> read_file('acronyms.csv')
>>> acronyms=read_file('acronyms.csv')
>>> acronyms['ABT']
Traceback (most recent call last):
File "<pyshell#65>", line 1, in <module>
acronyms['ABT']
TypeError: 'NoneType' object is not subscriptable
>>>
If I add return newDict to the end of the body of the function, it obviously just returns {'phrase[0]':'phrase[1]'} when I call read_file('acronyms.csv'). I've also tried {phrase[0]:phrase[1]} (no single quotation marks) but that returns the same error. Thanks for any help.
def read_acronym_meanings(path:str):
with open(path) as f:
acronyms = dict(l.strip().split(',') for l in f)
return acronyms
First off, you are creating a new dictionary at every iteration of the loop. Instead, create one dictionary and add elements every time you go over a line. Second, the 'phrase[0]' includes the apostrophes which turn make it a string instead of a reference to the phrase variable that you just created.
Also, try using the with keyword so that you don't have to explicitly close the file later.
def read(filename):
newDict = {}
with open(filename, 'r') as infile:
for line in infile:
line = line.strip() #remove newline character at end of each line
phrase = line.split(',')
newDict[phrase[0]] = phrase[1]}
return newDict
def read_file(filename):
infile = open(filename, 'r')
newDict = {}
for line in infile:
line = line.strip() #remove newline character at end of each line
phrase = line.split(',', 1) # split max of one time
newDict.update( {phrase[0]:phrase[1]})
infile.close()
return newDict
Your original creates a new dictionary every iteration of the loop.

Extracting data from a file using regular expressions and storing in a list to be compiled into a dictionary- python

I've been trying to extract both the species name and sequence from a file as depicted below in order to compile a dictionary with the key corresponding to the species name (FOX2_MOUSE for example) and the value corresponding to the Amino Acid sequence.
Sample fasta file:
>sp|P58463|FOXP2_MOUSE
MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
>sp|Q8MJ98|FOXP2_PONPY
MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELL
HLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVIT
PQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQL
LQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKE
QQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQA
ALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSST
TSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK
I've tried using my code below:
import re
InFileName = "foxp2.fasta"
InFile = open(InFileName, 'r')
Species = []
Sequence = []
reg = re.compile('FOXP2_\w+')
for Line in InFile:
Species += reg.findall(Line)
print Species
reg = re.compile('(^\w+)')
for Line in Infile:
Sequence += reg.findall(Line)
print Sequence
dictionary = dict(zip(Species, Sequence))
InFile.close()
However, my output for my lists are:
[FOX2_MOUSE, FOXP2_PONPY]
[]
Why is my second list empty? Are you not allowed to use re.compile() twice? Any suggestions on how to circumvent my problem?
Thank you,
Christy
If you want to read a file twice, you have to seek back to the beginning.
InFile.seek(0)
You can do it in a single pass, and without regular expressions:
def load_fasta(filename):
data = {}
species = ""
sequence = []
with open(filename) as inf:
for line in inf:
line = line.strip()
if line.startswith(";"): # is comment?
# skip it
pass
elif line.startswith(">"): # start of new record?
# save previous record (if any)
if species and sequence:
data[species] = "".join(sequence)
species = line.split("|")[2]
sequence = []
else: # continuation of previous record
sequence.append(line)
# end of file - finish storing last record
if species and sequence:
data[species] = "".join(sequence)
return data
data = load_fasta("foxp2.fasta")
On your given file, this produces data ==
{
'FOXP2_PONPY': 'MMQESVTETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSDKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ--HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK',
'FOXP2_MOUSE': 'MMQESATETISNSSMNQNGMSTLSSQLDAGSRDGRSSGDTSSEVSTVELLHLQQQQALQAARQLLLQQQTSGLKSPKSSEKQRPLQVPVSVAMMTPQVITPQQMQQILQQQVLSPQQLQALLQQQQAVMLQQQQLQEFYKKQQEQLHLQLLQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ-HPGKQAKEQQQQQQQQQ-LAAQQLVFQQQLLQMQQLQQQQHLLSLQRQGLISIPPGQAALPVQSLPQAGLSPAEIQQLWKEVTGVHSMEDNGIKHGGLDLTTNNSSSTTSSTTSKASPPITHHSIVNGQSSVLNARRDSSSHEETGASHTLYGHGVCK'
}
You could also do this in a single pass with a multiline regex:
import re
reg = re.compile('(FOXP2_\w+)\n(^[\w\n-]+)', re.MULTILINE)
with open("foxp2.fasta", 'r') as file:
data = dict(reg.findall(file.read()))
The downside is that you have to read the whole file in at once. Whether this is a problem depends on likely file sizes.

How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.

Categories