Python reading file and analysing lines with substring - python

In Python, I'm reading a large file with many many lines. Each line contains a number and then a string such as:
[37273738] Hello world!
[83847273747] Hey my name is James!
And so on...
After I read the txt file and put it into a list, I was wondering how I would be able to extract the number and then sort that whole line of code based on the number?
file = open("info.txt","r")
myList = []
for line in file:
line = line.split()
myList.append(line)
What I would like to do:
since the number in message one falls between 37273700 and 38000000, I'll sort that (along with all other lines that follow that rule) into a separate list

This does exactly what you need (for the sorting part)
my_sorted_list = sorted(my_list, key=lambda line: int(line[0][1:-2]))

Use tuple as key value:
for line in file:
line = line.split()
keyval = (line[0].replace('[','').replace(']',''),line[1:])
print(keyval)
myList.append(keyval)
Sort
my_sorted_list = sorted(myList, key=lambda line: line[0])

How about:
# ---
# Function which gets a number from a line like so:
# - searches for the pattern: start_of_line, [, sequence of digits
# - if that's not found (e.g. empty line) return 0
# - if it is found, try to convert it to a number type
# - return the number, or 0 if that conversion fails
def extract_number(line):
import re
search_result = re.findall('^\[(\d+)\]', line)
if not search_result:
num = 0
else:
try:
num = int(search_result[0])
except ValueError:
num = 0
return num
# ---
# Read all the lines into a list
with open("info.txt") as f:
lines = f.readlines()
# Sort them using the number function above, and print them
lines = sorted(lines, key=extract_number)
print ''.join(lines)
It's more resilient in the case of lines without numbers, it's more adjustable if the numbers might appear in different places (e.g. spaces at the start of the line).
(Obligatory suggestion not to use file as a variable name because it's a builtin function name already, and that's confusing).
Now there's an extract_number() function, it's easier to filter:
lines2 = [L for L in lines if 37273700 < extract_number(L) < 38000000]
print ''.join(lines2)

Related

How do I split a txt file based on a condition of a certain element in in a certain order list (Python)

So I need to split a txt file into a dictionary.
The txt file could look like this:
Keyone -2
key-two 1
Key'Three -3
Key four-here 5
I think I would need to check the list reversed to check if the second to last element is either a " " or a "-", but since there could be "-" between the words in the string, I'am a bit confused as to how to approach this.
I need the dict to look like [str(key); int(value)]
My tries so far, lookes like:
`
for line in file
a=line.split()
value = a[-1]
key=line[0:-2]
key=key.replace("-","")
`
try the following code:
# Define input
txt = "Keyone -2\nkey-two 1\nKey'Three -3\nKey four-here 5"
print(txt)
# Split the text by newlines
lines = txt.split('\n')
print(lines)
# Iterate over all lines
d = {}
for line in lines:
line.split(' ')
# The key is element after the last space
key = "".join(line[:-1])
# The value is everything before the first space
value = line[-1]
# Assuming it can only be an integer
value = int(value)
d[key] = value
print(d)
with open ("text.txt") as f:
for i in f:
a=i.split()
value=a[-1]
key=i[0:-2]
#print(type(key))
key=key.replace("-","")
d[key[0:-1]]=value
print(d)
The following is the answer using regex:
import re
data_to_parse = """
Keyone -2
key-two 1
Key'Three -3
Key four-here 5
"""
data_to_parse = data_to_parse.splitlines()
pattern = " -?\d"
new = {}
for line in data_to_parse:
if re.findall(pattern, line):
x = re.findall(pattern, line)
#print(line[line.find(x[0]) - 1:])
new[line[:line.find(x[0])].strip()] = line[line.find(x[0]):].strip()
print(new)
See the output:
EDITED:
If the values needs to be an integer, please change the line as following:
new[line[:line.find(x[0])].strip()] = int(line[line.find(x[0]):].strip())
So that the output is going to be below:

How to read a value of file separate by tabs in Python?

I have a text file with this format
ConfigFile 1.1
;
; Version: 4.0.32.1
; Date="2021/04/08" Time="11:54:46" UTC="8"
;
Name
John Legend
Type
Student
Number
s1054520
I would like to get the value of Name or Type or Number
How do I get it?
I tried with this method, but it does not solve my problem.
import re
f = open("Data.txt", "r")
file = f.read()
Name = re.findall("Name", file)
print(Name)
My expectation output is John Legend
Anyone can help me please. I really appreciated. Thank you
First of all re.findall is used to search for “all” occurrences that match a given pattern. So in your case. you are finding every "Name" in the file. Because that's what you are looking for.
On the other hand, the computer will not know the "John Legend" is the name. it will only know that's the line after the word "Name".
In your case I will suggest you can check this link.
Find the "Name"'s line number
Read the next line
Get the name without the white space
If there is more than 1 Name. this will work as well
the final code is like this
def search_string_in_file(file_name, string_to_search):
"""Search for the given string in file and return lines containing that string,
along with line numbers"""
line_number = 0
list_of_results = []
# Open the file in read only mode
with open(file_name, 'r') as read_obj:
# Read all lines in the file one by one
for line in read_obj:
# For each line, check if line contains the string
line_number += 1
if string_to_search in line:
# If yes, then add the line number & line as a tuple in the list
list_of_results.append((line_number, line.rstrip()))
# Return list of tuples containing line numbers and lines where string is found
return list_of_results
file = open('Data.txt')
content = file.readlines()
matched_lines = search_string_in_file('Data.txt', 'Name')
print('Total Matched lines : ', len(matched_lines))
for i in matched_lines:
print(content[i[0]].strip())
Here I'm going through each line and when I encounter Name I will add the next line (you can directly print too) to the result list:
import re
def print_hi(name):
result = []
regexp = re.compile(r'Name*')
gotname = False;
with open('test.txt') as f:
for line in f:
if gotname:
result.append(line.strip())
gotname = False
match = regexp.match(line)
if match:
gotname = True
print(result)
if __name__ == '__main__':
print_hi('test')
Assuming those label lines are in the sequence found in the file you
can simply scan for them:
labelList = ["Name","Type","Number"]
captures = dict()
with open("Data.txt","rt") as f:
for label in labelList:
while not f.readline().startswith(label):
pass
captures[label] = f.readline().strip()
for label in labelList:
print(f"{label} : {captures[label]}")
I wouldn't use a regex, but rather make a parser for the file type. The rules might be:
The first line can be ignored
Any lines that start with ; can be ignored.
Every line with no leading whitespace is a key
Every line with leading whitespace is a value belonging to the last
key
I'd start with a generator that can return to you any unignored line:
def read_data_lines(filename):
with open(filename, "r") as f:
# skip the first line
f.readline()
# read until no more lines
while line := f.readline():
# skip lines that start with ;
if not line.startswith(";"):
yield line
Then fill up a dict by following rules 3 and 4:
def parse_data_file(filename):
data = {}
key = None
for line in read_data_lines(filename):
# No starting whitespace makes this a key
if not line.startswith(" "):
key = line.strip()
# Starting whitespace makes this a value for the last key
else:
data[key] = line.strip()
return data
Now at this point you can parse the file and print whatever key you want:
data = parse_data_file("Data.txt")
print(data["Name"])

How do I fix this list index out of range error?

I'm trying to go through a txt file to pull out certain numbers, store them in a list, and then use the numbers to pull out strings stored in the same file. My code works on some of my files, but I'm suddenly getting a list index out of range error.
Here is an example of the portion of the text file im trying to get out
/note="tRNA-Arg2"
tRNA 5573494..5573567
/locus_tag="Tery_R0035"
/product="tRNA-Arg"
or
tRNA complement(5630800..5630872)
/locus_tag="Tery_R0036"
/product="tRNA-His"
I'm trying to get the numbers that are written after tRNA.
Here's my code to extract the numbers into a list:
def extract_numbers(line):
#empty list
numbers = []
#creates a buffer (temporary space)
digits = ""
#for character in the line
for c in line:
#if its a digit
if c.isdigit():
#add character to the buffer
digits += c
#if it isnt a number
else:
#if there is something in the buffer (ie its not 0)
if len(digits) > 0:
#add the buffer to the numbers list
numbers.append(digits)
#empty again
digits = ""
#to make sure the last number is added to the list
if len(digits) > 0:
numbers.append(digits)
return numbers
and to use the last function to write this over the file itself
def extract_tRNA(path):
with io.open(path, mode="r", encoding="utf-8") as file:
genome = file.readlines()
start_stop = []
for line in genome:
if "tRNA" in line[0:21]:
numbers = extract_numbers(line[21:])
start_stop.append((int(numbers[0]), int(numbers[1])))
return start_stop
then, I run it with this:
work_dir = "/Users/..."
for path in glob.glob(os.path.join(work_dir, "*.gbff")):
sequences = extract_seq(path)
tRNA_loc = extract_tRNA(path)
extract_genes(path, tRNA_loc, sequences)
print(path)
Is it my file or code? I'm also not sure if there is an easier way to do the same thing?
Thanks for any help!
UPDATE trying regex:
work_dir = "where my files are"
for path in glob.glob(os.path.join(work_dir, "*.gbff")):
with io.open(path, mode="r", encoding="utf-8") as file:
genome = file.readlines()
for line in genome:
if "tRNA" in line[0:21]:
p = re.compile('\d+') # \d means digit and + means one or more
m = p.findall(line)
print(m)
I'm assuming that you'd like a list of strings returned from your function extract_numbers.
Python uses a powerful feature called Regular Expressions (documentation).
Here is an example in which all of the strings of one or more digits are extracted.
import re
line = " tRNA 5573494..5573567"
p = re.compile('\d+') # \d means digit and + means one or more
m = p.findall(line)
m # returns ['5573494', '5573567']
Based on your description of what you want to achieve, this should work. Note, file.txt is a sample that you included above:
import re
with open("file.txt") as f:
data =f.readlines()
numberList = []
for line in data:
dataList = line.split() #words separated by spaces split into list
try: #if tRNA is not in line
numberIndex = dataList.index("tRNA") + 1 # the numbers that are written after tRNA
numberList.append(dataList[numberIndex])
except Exception as _:
continue
#The above cleans you data from all other numbers i.e "Tery_R0035"
#Taken from top answer (#rajah9)
p = re.compile('\d+') # \d means digit and + means one or more
for numData in numberList:
m = p.findall(numData)
print(m)

how to create a list and sum up

I am relatively new to python and got stuck on the below:
Below is the code I am working with
import re
handle = open ('RegExWeek2.txt')
for line in handle:
line = line.rstrip()
x = re.findall('[0-9]+', line)
if len(x) > 0:
print x
The return from this code looks like this:
['7430']
['9401', '9431']
['2248', '2047']
['5517']
['3184', '1241']
['9939']
['2185', '9450', '8428']
['369']
['3683', '6442', '7654']
Question: how do I combine this to one list and sum up the numbers?
Please help
You may change your code like this,
handle = open ('RegExWeek2.txt')
num = []
for line in handle:
num.extend(re.findall('[0-9]+', line))
print sum(int(i) for i in num)
Since you're using re.findall, this line.rstrip() line is not necessary.
And also there won't be possible for x to be an empty list, since we are using + next to [0-9] (repeats the previous token one or more times) not * (zero or more times)
There's no need to rstrip, and you should open files using with:
import re
all_numbers = []
with open('RegExWeek2.txt') as file:
for line in file:
numbers = re.findall('[0-9]+', line)
for number in numbers:
all_numbers.append(int(number))
print(sum(all_numbers))
This is really beginner code, and a direct translation of yours. Here's how I would write it:
with open('RegExWeek2.txt') as file:
all_numbers = [int(num) for num in re.findall('[0-9]+', file.read())]
print(sum(all_numbers))

How can I use readline() to begin from the second line?

I'm writing a short program in Python that will read a FASTA file which is usually in this format:
>gi|253795547|ref|NC_012960.1| Candidatus Hodgkinia cicadicola Dsem chromosome, 52 lines
GACGGCTTGTTTGCGTGCGACGAGTTTAGGATTGCTCTTTTGCTAAGCTTGGGGGTTGCGCCCAAAGTGA
TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC
TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTGTTGGGCTCGCCGAAAGCTGGCAGGTCGA
I've created another program that reads the first line(aka header) of this FASTA file and now I want this second program to start reading and printing beginning from the sequence.
How would I do that?
so far i have this:
FASTA = open("test.txt", "r")
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
for line in FASTA:
line = line.strip()
print line
readSeq(FASTA)
Thanks guys
-Noob
def readSeq(FASTA):
"""returns the DNA sequence of a FASTA file"""
_unused = FASTA.next() # skip heading record
for line in FASTA:
line = line.strip()
print line
Read the docs on file.next() to see why you should be wary of mixing file.readline() with for line in file:
you should show your script. To read from second line, something like this
f=open("file")
f.readline()
for line in f:
print line
f.close()
You might be interested in checking BioPythons handling of Fasta files (source).
def FastaIterator(handle, alphabet = single_letter_alphabet, title2ids = None):
"""Generator function to iterate over Fasta records (as SeqRecord objects).
handle - input file
alphabet - optional alphabet
title2ids - A function that, when given the title of the FASTA
file (without the beginning >), will return the id, name and
description (in that order) for the record as a tuple of strings.
If this is not given, then the entire title line will be used
as the description, and the first word as the id and name.
Note that use of title2ids matches that of Bio.Fasta.SequenceParser
but the defaults are slightly different.
"""
#Skip any text before the first record (e.g. blank lines, comments)
while True:
line = handle.readline()
if line == "" : return #Premature end of file, or just empty?
if line[0] == ">":
break
while True:
if line[0]!=">":
raise ValueError("Records in Fasta files should start with '>' character")
if title2ids:
id, name, descr = title2ids(line[1:].rstrip())
else:
descr = line[1:].rstrip()
id = descr.split()[0]
name = id
lines = []
line = handle.readline()
while True:
if not line : break
if line[0] == ">": break
#Remove trailing whitespace, and any internal spaces
#(and any embedded \r which are possible in mangled files
#when not opened in universal read lines mode)
lines.append(line.rstrip().replace(" ","").replace("\r",""))
line = handle.readline()
#Return the record and then continue...
yield SeqRecord(Seq("".join(lines), alphabet),
id = id, name = name, description = descr)
if not line : return #StopIteration
assert False, "Should not reach this line"
good to see another bioinformatician :)
just include an if clause within your for loop above the line.strip() call
def readSeq(FASTA):
for line in FASTA:
if line.startswith('>'):
continue
line = line.strip()
print(line)
A pythonic and simple way to do this would be slice notation.
>>> f = open('filename')
>>> lines = f.readlines()
>>> lines[1:]
['TTAGATTTTCCGACAGCGTACGGCGCGCGCTGCTGAACGTGGCCACTGAGCTTACACCTCATTTCAGCGC\n', 'TCGCTTGCTGGCGAAGCTGGCAGCAGCTTGTTAATGCTAGTG
TTGGGCTCGCCGAAAGCTGGCAGGTCGA']
That says "give me all elements of lines, from the second (index 1) to the end.
Other general uses of slice notation:
s[i:j] slice of s from i to j
s[i:j:k] slice of s from i to j with step k (k can be negative to go backward)
Either i or j can be omitted (to imply the beginning or the end), and j can be negative to indicate a number of elements from the end.
s[:-1] All but the last element.
Edit in response to gnibbler's comment:
If the file is truly massive you can use iterator slicing to get the same effect while making sure you don't get the whole thing in memory.
import itertools
f = open("filename")
#start at the second line, don't stop, stride by one
for line in itertools.islice(f, 1, None, 1):
print line
"islicing" doesn't have the nice syntax or extra features of regular slicing, but it's a nice approach to remember.

Categories