Compiling lines from file that are separated by a certain element . Python - python

File:
>1
ATTTTttttGGGG
ccCgCgGAgggGGT
gggggttttTTTTTTTTT
>2
ATcggGGGGGGA
>3
ATCGGGGGGATTT
gggggttAGTAttt
i'm constructing a function that reads files that have this format.
the format has multiple files embedded in it that are separated by '>'+the name (e.g. '>1','>2')
i'm trying to get the lines of text flanked by the '>' lines and compile them into one string per section
so this would look like
name_list = ['>1','>2','>3']
sequence_list = ['ATTTTttttGGGGccCgCgGAgggGGTgggggttttTTTTTTTTT','ATcggGGGGGGA','ATCGGGGGGATTTgggggttAGTAttt']
import os
import re
# Open File
in_file=open(FASTA,'r')
dir,file=os.path.split(FASTA)
temp = os.path.join(dir,output)
out_file=open(temp,'w')
# Generating lines
lines = []
name_list = []
seq_list = []
for line in in_file:
line = line.strip()
lines.append(line)
in_file.close()
indx = range(0,len(lines))
# Organizing the elements
for line in lines:
for i in line:
if i == '>':
name_list.append(line)
else:
break
I don't know what to do for the else: statement
I tried creating an index with range(0,len(lines))
so maybe i could do something where it finds '>' and compile all lines for the following indices until it finds the next '>' and adds them to the list called seq_list
any help would be greatly appreciated

You should take a look at Biopython that has a FASTA parser, but here's an example using the standard lib:
import re
with open('filename') as f:
print [i.replace('\n','') for i in re.split(r'\>\d+',f.read()) if i]
out:
['ATTTTttttGGGGccCgCgGAgggGGTgggggttttTTTTTTTTT',
'ATcggGGGGGGA',
'ATCGGGGGGATTTgggggttAGTAttt']
Using Biopython [sudo pip install biopython]:
from Bio import SeqIO
with open("example.fasta", "rU") as handle:
print list(SeqIO.parse(handle, "fasta"))
out:
[SeqRecord(seq=Seq('ATTTTttttGGGGccCgCgGAgggGGTgggggttttTTTTTTTTT', SingleLetterAlphabet()), id='1', name='1', description='1', dbxrefs=[]),
SeqRecord(seq=Seq('ATcggGGGGGGA', SingleLetterAlphabet()), id='2', name='2', description='2', dbxrefs=[]),
SeqRecord(seq=Seq('ATCGGGGGGATTTgggggttAGTAttt', SingleLetterAlphabet()), id='3', name='3', description='3', dbxrefs=[])]

A dictionary would make life easier:
>>> d = {}
>>> with open('t.txt') as f:
... for line in f:
... if line.startswith('>'):
... key = line.strip()
... if key not in d:
... d[key] = []
... else:
... d[key].append(line.strip())
...
>>> d
{'>1': ['ATTTTttttGGGG', 'ccCgCgGAgggGGT', 'gggggttttTTTTTTTTT'],
'>2': ['ATcggGGGGGGA'], '>3': ['ATCGGGGGGATTT', 'gggggttAGTAttt']}
>>> sequence_list = [''.join(k) for k in d.values()]
>>> sequence_list
['ATTTTttttGGGGccCgCgGAgggGGTgggggttttTTTTTTTTT',
'ATcggGGGGGGA', 'ATCGGGGGGATTTgggggttAGTAttt']

Related

Find frequency of words line by line in txt file Python (how to format properly)

I'm trying to make a simple program that can find the frequency of occurrences in a text file line by line. I have it outputting everything correctly except for when more than one word is on a line in the text file. (More information below)
The text file looks like this:
Hello
Hi
Hello
Good Day
Hi
Good Day
Good Night
I want the output to be: (Doesn't have to be in the same order)
Hello: 2
Hi: 2
Good Day: 2
Good Night: 2
What it's currently outputting:
Day: 2
Good: 3
Hello: 2
Hi: 2
Night: 1
My code:
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split(None)
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
You want to preserve the lines. Don't split. Don't capitalize. Don't sort
Use a Counter
from collections import Counter
c = Counter()
with open('test.txt') as f:
for line in f:
c[line.rstrip()] += 1
for k, v in c.items():
print('{}: {}'.format(k, v))
Instead of splitting the text by None, split it by each line break so you get each line into a list.
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split('\n')
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
You can make it yourself very easy by using a Counter object. If you want to count the occurrences of full lines you can simply do:
from collections import Counter
with open('file.txt') as f:
c = Counter(f)
print(c)
Edit
Since you asked for a way without modules:
counter_dict = {}
with open('file.txt') as f:
l = f.readlines()
for line in l:
if line not in counter_dict:
counter_dict[line] = 0
counter_dict[line] +=1
print(counter_dict)
Thank you all for the answers, most of the code produces the desired output just in different ways. The code I ended up using with no modules was this:
file = open("test.txt", "r")
text = file.read() #reads file (I've tried .realine() & .readlines()
word_list = text.split('\n')
word_freq = {} # Declares empty dictionary
for word in word_list:
word_freq[word] = word_freq.get(word, 0) + 1
keys = sorted(word_freq.keys())
for word in keys:
final=word.capitalize()
print(final + ': ' + str(word_freq[word])) # Line that prints the output
The code I ended up using with modules was this:
from collections import Counter
c = Counter()
with open('live.txt') as f:
for line in f:
c[line.rstrip()] += 1
for k, v in c.items():
print('{}: {}'.format(k, v))

Fastest way to convert files into lists?

I have a .txt file which contains some words:
e.g
bye
bicycle
bi
cyc
le
and i want to return a list which contains all the words in the file. I have tried some code which actually works but i think it takes a lot of time to execute for bigger files. Is there a way to make this code more efficient?
with open('file.txt', 'r') as f:
for line in f:
if line == '\n': --> #blank line
lst1.append(line)
else:
lst1.append(line.replace('\n', '')) --> #the way i find more efficient to concatenate letters of a specific word
str1 = ''.join(lst1)
lst_fin = str1.split()
expected output:
lst_fin = ['bye', 'bicycle', 'bicycle']
I don't know if this is more efficient, but at least it's an alternative... :)
with open('file.txt') as f:
words = f.read().replace('\n\n', '|').replace('\n', '').split('|')
print(words)
...or if you don't want to insert a character like '|' (which could be already there) into the data you could do also
with open('file.txt') as f:
words = f.read().split('\n\n')
words = [w.replace('\n', '') for w in words]
print(words)
result is the same in both cases:
# ['bye', 'bicycle', 'bicycle']
EDIT:
I think I have another approach. However, it requires the file not to start with a blank line, iiuc...
with open('file.txt') as f:
res = []
current_elmnt = next(f).strip()
for line in f:
if line.strip():
current_elmnt += line.strip()
else:
res.append(current_elmnt)
current_elmnt = ''
print(words)
Perhaps you want to give it a try...
You can use the iter function with a sentinel of '' instead:
with open('file.txt') as f:
lst_fin = list(iter(lambda: ''.join(iter(map(str.strip, f).__next__, '')), ''))
Demo: https://repl.it/#blhsing/TalkativeCostlyUpgrades
You could use this(I don't know about its efficiency):
lst = []
s = ''
with open('tp.txt', 'r') as file:
l = file.readlines()
for i in l:
if i == '\n':
lst.append(s)
s = ''
elif i == l[-1]:
s += i.rstrip()
lst.append(s)
else:
s+= i.rstrip()
print(lst)

Indexing lines in a Python file

I want to open a file, and simply return the contents of said file with each line beginning with the line number.
So hypothetically if the contents of a is
a
b
c
I would like the result to be
1: a
2: b
3: c
Im kind of stuck, tried enumerating but it doesn't give me the desired format.
Is for Uni, but only a practice test.
A couple bits of trial code to prove I have no idea what I'm doing / where to start
def print_numbered_lines(filename):
"""returns the infile data with a line number infront of the contents"""
in_file = open(filename, 'r').readlines()
list_1 = []
for line in in_file:
for item in line:
item.index(item)
list_1.append(item)
return list_1
def print_numbered_lines(filename):
"""returns the infile data with a line number infront of the contents"""
in_file = open(filename, 'r').readlines()
result = []
for i in in_file:
result.append(enumerate(i))
return result
A file handle can be treated as an iterable.
with open('tree_game2.txt') as f:
for i, line in enumerate(f):
print ("{0}: {1}".format(i+1,line))
There seems no need to write a python script, awk would solve your problem.
awk '{print NR": "$1}' your_file > new_file
What about using an OrderedDict
from collections import OrderedDict
c = OrderedDict()
n = 1
with open('file.txt', 'r') as f:
for line in f:
c.update({n:line})
#if you just want to print it, skip the dict part and just do:
print n,line
n += 1
Then you can print it out with:
for n,line in c.iteritems(): #.items() if Python3
print k,line
the simple way to do it:
1st:with open the file -----2ed:using count mechanism:
for example:
data = object of file.read()
lines = data.split("\n")
count =0
for line in lines:
print("line "+str(count)+">"+str()+line)
count+=1

load words from file and make a list of that

My idea is to load words from a directory (contains A Words.txt- Z Words.txt) and copy it into a list. The below code works, but adds "\n" at the end of each word (example ["apple\n", "abort\n"]); can anybody suggest a way to fix it?
from io import *
import string
def load_words(base_dir):
words = []
for i in string.uppercase:
location = base_dir+"\\"+i+" Words.txt"
with open(location, "rb+") as f:
words += f.readlines()
return words
change
words += f.readlines()
to :
words += [x.strip() for x in f.readlines()]
strip() removes trailing and leading whitespace charachters.
Explicitly strip newlines using str.rstrip:
def load_words(base_dir):
words = []
for i in string.uppercase:
location = base_dir+"\\"+i+" Words.txt"
with open(location, "rb+") as f:
for line in f: # <---------
words.append(line.rstrip()) # <---------
# OR words.extend(line.rstrip() for line in f)
return words
Try this. Hope it helps.
from io import *
import string
def load_words(base_dir):
words = []
for i in string.uppercase:
location = base_dir+"\\"+i+" Words.txt"
with open(location, "rb+") as f:
for i in f.readlines():
words.append(i.strip())
return words

Count lines matching different patterns in one pass

I have a python script the given a pattern goes over a file and for each line that matches the pattern it keeps counts how many times that line shows up in the file.
The script is the following:
#!/usr/bin/env python
import time
fnamein = 'Log.txt'
def filter_and_count_matches(fnamein, fnameout, match):
fin = open(fnamein, 'r')
curr_matches = {}
order_in_file = [] # need this because dict has no particular order
for line in (l for l in fin if l.find(match) >= 0):
line = line.strip()
if line in curr_matches:
curr_matches[line] += 1
else:
curr_matches[line] = 1
order_in_file.append(line)
#
fout = open(fnameout, 'w')
#for line in order_in_file:
for line, _dummy in sorted(curr_matches.iteritems(),
key=lambda (k, v): (v, k), reverse=True):
fout.write(line + '\n')
fout.write(' = {}\n'.format(curr_matches[line]))
fout.close()
def main():
for idx, match in enumerate(open('staffs.txt', 'r').readlines()):
curr_time = time.time()
match = match.strip()
fnameout = 'm{}.txt'.format(idx+1)
filter_and_count_matches(fnamein, fnameout, match)
print 'Processed {}. Time = {}'.format(match, time.time() - curr_time)
main()
So right now I am going over the file each time I want to check for a different pattern.
It is possible to do this go going over the file just once (the file is quite big, so it takes a while to process). It would be nice to be able to do this in a elegant "easy" way. Thanks!
Thanks
Looks like a Counter would do what you need:
from collections import Counter
lines = Counter([line for line in myfile if match_string in line])
For example, if myfile contains
123abc
abc456
789
123abc
abc456
and match_string is "abc", then the above code gives you
>>> lines
Counter({'123abc': 2, 'abc456': 2})
For multiple patterns, how about this:
patterns = ["abc", "123"]
# initialize one Counter for each pattern
results = {pattern:Counter() for pattern in patterns}
for line in myfile:
for pattern in patterns:
if pattern in line:
results[pattern][line] += 1

Categories