Parse fasta sequence to the dictionary

Parse fasta sequence to the dictionary - python

I need most trivial solution to convert fasta.txt containing multiple nucleotide sequences like
>seq1
TAGATTCTGAGTTATCTCTTGCATTAGCAGGTCATCCTGGTCAAACCGCTACTGTTCCGG
CTTTCTGATAATTGATAGCATACGCTGCGAACCCACGGAAGGGGGTCGAGGACAGTGGTG
>seq2
TCCCTCTAGAGGCTCTTTACCGTGATGCTACATCTTACAGGTATTTCTGAGGCTCTTTCA
AACAGGTGCGCGTGAACAACAACCCACGGCAAACGAGTACAGTGTGTACGCCTGAGAGTA
>seq3
GGTTCCGCTCTAAGCCTCTAACTCCCGCACAGGGAAGAGATGTCGATTAACTTGCGCCCA
TAGAGCTCTGCGCGTGCGTCGAAGGCTCTTTTCGCGATATCTGTGTGGTCTCACTTTGGT
to the dictionary(name,value) object where name will be the >header and value will be assigned to corresponded sequence.
Below you can find my failed attempt do it via 2 lists (does not work for long sequence containing >1 line )
f = open('input2.txt', 'r')
list={}
names=[]
seq=[]
for line in f:
if line.startswith('>'):
names.append(line[1:-1])
elif line.startswith('A') or line.startswith('C') or line.startswith('G') or line.startswith('T'):
seq.append(line)
list = dict(zip(names, seq))
I'll be thankful if you provide me with the solution of how fix it and example how to do it via separate function.
Thanks for help,
Gleb

It is better to use biopython library
from Bio import SeqIO
input_file = open("input.fasta")
my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))

a simple correction to your code:
from collections import defaultdict #this will make your life simpler
f = open('input2.txt','r')
list=defaultdict(str)
name = ''
for line in f:
#if your line starts with a > then it is the name of the following sequence
if line.startswith('>'):
name = line[1:-1]
continue #this means skips to the next line
#This code is only executed if it is a sequence of bases and not a name.
list[name]+=line.strip()
UPDATE:
Since I've got a notification that this old answer was upvoted, I've decided to present what I now think is the proper solution using Python 3.7. Translation to Python 2.7 only requires removing the typing import line and the function annotations:
from collections import OrderedDict
from typing import Dict
NAME_SYMBOL = '>'
def parse_sequences(filename: str,
ordered: bool=False) -> Dict[str, str]:
"""
Parses a text file of genome sequences into a dictionary.
Arguments:
filename: str - The name of the file containing the genome info.
ordered: bool - Set this to True if you want the result to be ordered.
"""
result = OrderedDict() if ordered else {}
last_name = None
with open(filename) as sequences:
for line in sequences:
if line.startswith(NAME_SYMBOL):
last_name = line[1:-1]
result[last_name] = []
else:
result[last_name].append(line[:-1])
for name in result:
result[name] = ''.join(result[name])
return result
Now, I realize that the OP asked for the "most trivial solution", however since they are working with genome data, it seems fair to assume that each sequence could potentially be very large. In that case it makes sense to optimize a little bit by collecting the sequence lines into a list, and then to use the str.join method on those lists at the end to produce the final result.

Related

multiple modification to a list at once

I have a text file of some ip's and Mac's. The format of the Mac's are xxxx.xxxx.xxxx, I need to change all the MAC's to xx:xx:xx:xx:xx:xx
I am already reading the file and putting it into a list. Now I am looping through each line of the list and I need to make multiple modification. I need to remove the IP's and then change the MAC format.
The problem I am running into is that I cant seem to figure out how to do this in one shot unless I copy the list to a newlist for every modification.
How can I loop through the list once, and update each element on the list with all my modification?
count = 0
output3 = []
for line in output:
#print(line)
#removes any extra spaces between words in a string.
output[count] = (str(" ".join(line.split())))
#create a new list with just the MAC addresses
output3.append(str(output[count].split(" ")[3]))
#create a new list with MAC's using a ":"
count += 1
print(output3)

It appears you are trying to overthink the problem, so that may be where your frustration is spinning you around a bit.
First, you should always consider if you need a count variable in python. Usually you do not, and the enumerate() function is your friend here.
Second, there is no need to process data multiple times in python. You can use variables to your advantage and leverage python's expressiveness, rather than trying to hide your problem from the language.
PSA an implementation example that may help you think through your approach. Good luck on solving your harder problems, and I hope python will help you out with them!
#! /usr/bin/env python3
import re
from typing import Iterable
# non-regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: this assumes a source with '.' separators only
# reformat_mac = lambda _: ':'.join(_ for _ in _.split('.') for _ in (_[:2], _[2:]))
# regex reformat mac to be xx:xx:xx:xx:xx:xx
# NOTE: Only requires at least two hex digits adjacent at a time
reformat_mac = lambda _: ":".join(re.findall(r"(?i)[\da-f]{2}", _))
def generate_output3(output: Iterable[str]) -> Iterable[str]:
for line in output:
col1, col2, col3, mac, *cols = line.split()
mac = reformat_mac(mac)
yield " ".join((col1, col2, col3, mac, *cols))
if __name__ == "__main__":
output = [
"abc def ghi 1122.3344.5566",
"jklmn op qrst 11a2.33c4.55f6 uv wx yz",
"zyxwu 123 next 11a2.33c4.55f6 uv wx yz",
]
for line in generate_output3(output):
print(line)

Solution
You can use the regex (regular expression) module to extract any pattern that matches that of the
mac-ids: "xxxx:xxxx:xxxx" and then process it to produce the expected output ("xx-xx-xx-xx-xx-xx")
as shown below.
Note: I have used a dummy data file (see section: Dummy Data below) to make this answer
reproducible. It should work with your data as well.
# import re
filepath = "input.txt"
content = read_file(filepath)
mac_ids = extract_mac_ids(content, format=True) # format=False --> "xxxx:xxxx:xxxx"
print(mac_ids)
## OUTPUT:
#
# ['a0-b1-ff-33-ac-d5',
# '11-b9-33-df-55-f6',
# 'a4-d1-e7-33-ff-55',
# '66-a1-b2-f3-b9-c5']
Code: Convenience Functions
How does the regex work? see this example
def read_file(filepath: str):
"""Reads and returns the content of a file."""
with open(filepath, "r") as f:
content = f.read() # read in one attemp
return content
def format_mac_id(mac_id: str):
"""Returns a formatted mac_id.
INPUT FORMAT: "xxxxxxxxxxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
mac_id = list(mac_id)
mac_id = ''.join([ f"-{v}" if (i % 2 == 0) else v for i, v in enumerate(mac_id)])[1:]
return mac_id
def extract_mac_ids(content: str, format: bool=True):
"""Extracts and returns a list of formatted mac_ids after.
INPUT FORMAT: "xxxx:xxxx:xxxx"
OUTPUT FORMAT: "xx-xx-xx-xx-xx-xx"
"""
import re
# pattern = "(" + ':'.join([r"\w{4}"]*3) + "|" + ':'.join([r"\w{2}"]*6) + ")"
# pattern = r"(\w{4}:\w{4}:\w{4}|\w{2}:\w{2}:\w{2}:\w{2}:\w{2}:\w{2})"
pattern = r"(\w{4}:\w{4}:\w{4})"
pat = re.compile(pattern)
mac_ids = pat.findall(content) # returns a list of all mac-ids
# Replaces the ":" with "" and then formats
# each mac-id as: "xx-xx-xx-xx-xx-xx"
if format:
mac_ids = [format_mac_id(mac_id.replace(":", "")) for mac_id in mac_ids]
return mac_ids
Dummy Data
The following code block creates a dummy file with some sample mac-ids.
filepath = "input.txt"
s = """
a0b1:ff33:acd5 ghwvauguvwi ybvakvi
klasilvavh; 11b9:33df:55f6
haliviv
a4d1:e733:ff55
66a1:b2f3:b9c5
"""
# Create dummy data file
with open(filepath, "w") as f:
f.write(s)

In Python, how to match a string to a dictionary item (like 'Bra*')

I'm a complete novice at Python so please excuse me for asking something stupid.
From a textfile a dictionary is made to be used as a pass/block filter.
The textfile contains addresses and either a block or allow like "002029568,allow" or "0011*,allow" (without the quotes).
The search-input is a string with a complete code like "001180000".
How can I evaluate if the search-item is in the dictionary and make it match the "0011*,allow" line?
Thank you very much for your efford!
The filter-dictionary is made with:
def loadFilterDict(filename):
global filterDict
try:
with open(filename, "r") as text_file:
lines = text_file.readlines()
for s in lines:
fields = s.split(',')
if len(fields) == 2:
filterDict[fields[0]] = fields[1].strip()
text_file.close()
except:
pass
Check if the code (ccode) is in the dictionary:
if ccode in filterDict:
if filterDict[ccode] in ['block']:
continue
else:
if filterstat in ['block']:
continue
The filters-file is like:
002029568,allow
000923993,allow
0011*, allow

If you can use re, you don't have to worry about the wildcard but let re.match do the hard work for you:
# Rules input (this could also be read from file)
lines = """002029568,allow
0011*,allow
001180001,block
"""
# Parse rules from string
rules = []
for line in lines.split("\n"):
line = line.strip()
if not line:
continue
identifier, ruling = line.split(",")
rules += [(identifier, ruling)]
# Get rulings for specific number
def rule(number):
from re import match
rulings = []
for identifier, ruling in rules:
# Replace wildcard with regex .*
identifier = identifier.replace("*", ".*")
if match(identifier, number):
rulings += [ruling]
return rulings
print(rule("001180000"))
print(rule("001180001"))
Which prints:
['allow']
['allow', 'block']
The function will return a list of rulings. Their order is the same order as they appear in your config lines. So you could easily just pick the last or first ruling whichever is the one you're interested in.
Or break the loop prematurely if you can assume that no two rulings will interfere.
Examples:
001180000 is matched by 0011*,allow only, so the only ruling which applies is allow.
001180001 is matched by 0011*,allow at first, so you'll get allow as before. However, it is also matched by 001180001,block, so a block will get added to the rulings, too.

If the wildcard entries in the file have a fixed length (for example, you only need to support lines like 0011*,allow and not 00110*,allow or 0*,allow or any other arbitrary number of digits followed by *) you can use a nested dictionary, where the outer keys are the known parts of the wildcarded entries.
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
Then when you parse the file and get to the line 0011*,allow you do not need to do any matching. All you have to do is check if '0011' is present. Crude example:
d = {'0011': {'001180000': 'value', '001180001': 'value'}}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
if prefix in d:
# there is a "match", then you can deal with all the entries that match,
# in this case the items in the inner dictionary
# {'001180000': 'value', '001180001': 'value'}
print('match')
else:
print('no match')
If you do need to support arbitrary lengths of wildcarded entries, you will have to resort to a loop iterating over the dictionary (and therefore beating the point of using a dictionary to begin with):
d = {'001180000': 'value', '001180001': 'value'}
line = '0011*,allow'
prefix = line.split(',')[0][:-1]
for k, v in d.items():
if k.startswith(prefix):
# found matching key-value pair
print(k, v)

Counting keyword in the file

I am trying to count the keywords in a .py file but the code I wrote is also counting keywords which occur in strings.
How can I differentiate between actual keywords and the ones in strings? For example: is, with and in are keywords, but you can also spot those in comments and user input strings. This is what I have tried:
from collections import Counter
import keyword
count = {}
scode = input("Enter the name of Python source code file name :")
with open(scode,'r') as f:
for line in f:
words = line.split()
for i in words:
if(keyword.iskeyword(i)):
count[i]= count.get(i,0)+1
print(count)

You can use ast.parse to parse the code, create a ast.NodeTransformer subclass to clear all the string nodes (no need to clear comments because comments are automatically ignored by ast.parse already), install the astunparse package to turn the node back to source code, and then count the keywords:
import ast
import astunparse
import keyword
import re
class clear_strings(ast.NodeTransformer):
def visit_Str(self, node):
node.s = ''
return node
n = ast.parse('''
a = 'True'
assert False
# [[] for _ in range(9)]
"""if"""
''')
clear_strings().visit(n)
print(sum(map(keyword.iskeyword, re.findall(r'\w+', astunparse.unparse(n)))))
This outputs: 2 (because only assert and False are counted as keywords)

Can't get unique word/phrase counter to work - Python

I'm having trouble getting anything to write in my outut file (word_count.txt).
I expect the script to review all 500 phrases in my phrases.txt document, and output a list of all the words and how many times they appear.
from re import findall,sub
from os import listdir
from collections import Counter
# path to folder containg all the files
str_dir_folder = '../data'
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
# loop through all the files in the directory
for str_each_file in listdir(str_dir_folder):
if str_each_file.endswith('data'):
# open file and read
with open(str_dir_folder+str_each_file,'r') as file_r_data:
str_file_data = file_r_data.read()
# add data to list
list_file_data.append(str_file_data)
# clean all the data so that we don't have all the nasty bits in it
str_full_data = ' '.join(list_file_data)
str_clean1 = sub('t','',str_full_data)
str_clean_data = sub('n',' ',str_clean1)
# find all the words and put them into a list
list_all_words = findall('w+',str_clean_data)
# dictionary with all the times a word has been used
dict_word_count = Counter(list_all_words)
# put data in a list, ready for output file
list_output_data = []
for str_each_item in dict_word_count:
str_word = str_each_item
int_freq = dict_word_count[str_each_item]
str_out_line = '"%s",%d' % (str_word,int_freq)
# populates output list
list_output_data.append(str_out_line)
# create output file, write data, close it
file_w_output = open(str_output_file,'w')
file_w_output.write('n'.join(list_output_data))
file_w_output.close()
Any help would be great (especially if I'm able to actually output 'single' words within the output list.
thanks very much.

Would be helpful if we got more information such as what you've tried and what sorts of error messages you received. As kaveh commented above, this code has some major indentation issues. Once I got around those, there were a number of other logic errors to work through. I've made some assumptions:
list_file_data is assigned to '../data/phrases.txt' but there is then a
loop through all file in a directory. Since you don't have any handling for
multiple files elsewhere, I've removed that logic and referenced the
file listed in list_file_data (and added a small bit of error
handling). If you do want to walk through a directory, I'd suggest
using os.walk() (http://www.tutorialspoint.com/python/os_walk.htm)
You named your file 'pharses.txt' but then check for if the files
that endswith 'data'. I've removed this logic.
You've placed the data set into a list when findall works just fine with strings and ignores special characters that you've manually removed. Test here:
https://regex101.com/ to make sure.
Changed 'w+' to '\w+' - check out the above link
Converting to a list outside of the output loop isn't necessary - your dict_word_count is a Counter object which has an 'iteritems' method to roll through each key and value. Also changed the variable name to 'counter_word_count' to be slightly more accurate.
Instead of manually generating csv's, I've imported csv and utilized the writerow method (and quoting options)
Code below, hope this helps:
import csv
import os
from collections import Counter
from re import findall,sub
# name and location of output file
str_output_file = '../data/word_count.txt'
# the list where all the words will be placed
list_file_data = '../data/phrases.txt'
if not os.path.exists(list_file_data):
raise OSError('File {} does not exist.'.format(list_file_data))
with open(list_file_data, 'r') as file_r_data:
str_file_data = file_r_data.read()
# find all the words and put them into a list
list_all_words = findall('\w+',str_file_data)
# dictionary with all the times a word has been used
counter_word_count = Counter(list_all_words)
with open(str_output_file, 'w') as output_file:
fieldnames = ['word', 'freq']
writer = csv.writer(output_file, quoting=csv.QUOTE_ALL)
writer.writerow(fieldnames)
for key, value in counter_word_count.iteritems():
output_row = [key, value]
writer.writerow(output_row)

Something like this?
from collections import Counter
from glob import glob
def extract_words_from_line(s):
# make this as complicated as you want for extracting words from a line
return s.strip().split()
tally = sum(
(Counter(extract_words_from_line(line))
for infile in glob('../data/*.data')
for line in open(infile)),
Counter())
for k in sorted(tally, key=tally.get, reverse=True):
print k, tally[k]

find key in dictionary and print value

Hello everyone I am stuck on a class assignment and not sure where to go at this point as my college does not offer tutors for the programming field as this is the first semester that this has been offered. Assignment is:
Write a program that:
Prints out the toy name for that code in a useful message such as, ‘The toy for that code is a Baseball’
The program exits when instead of a toy code, the user enters ‘quit’
below is a sample of the text file that the dict is suppose to populate from
D1,Tyrannasaurous
D2,Apatasauros
D3,Velociraptor
D4,Tricerotops
D5,Pterodactyl
T1,Diesel-Electric
T2,Steam Engine
T3,Box Car
and what I have gotten so far is:
**
fin=open('C:/Python34/Lib/toys.txt','r')
print(fin)
toylookup=dict() #creates a dictionary named toy lookup
def get_line(): #get a single line from the file
newline=fin.readline() #get the line
newline=newline.strip() #strip away extra characters
return newline
print ('please enter toy code here>>>')
search_toy_code= input()
for toy_code in toylookup.keys():
if toy_code == search_toy_code:
print('The toy for that code is a','value')
else:
print("toy code not found")
**
and to be honest I am not even sure I am right with what I have. any help at all would be greatly appreciate thank you.

There are two issues.
Your dictionary isn't getting populated; however there currently isn't enough info in your question to help with that problem. Need to know what the file looks like, etc.
Your lookup loop won't display the values for keys that match. Below is the solution for that.
Try iterating over key:value pairs like this:
for code, toy in toylookup.items():
if key == search_toy_code:
print('The toy for that code ({}) is a {}'.format(code, toy))
else:
print("Toy code ({}) not found".format(code))
Take a look at the docs for dict.items():
items():
Return a new view of the dictionary’s items ((key, value) pairs).

You should make yourself familiar with basic python programming. In order to solve such tasks you need to know about basic data structures and loops.
# define path and name of file
filepath = "test.txt" # content like: B1,Baseball B2,Basketball B3,Football
# read file data
with open(filepath) as f:
fdata = f.readlines() # reads every line in fdata
# fdata is now a list containing each line
# prompt the user
print("please enter toy code here: ")
user_toy_code = input()
# dict container
toys_dict = {}
# get the items with toy codes and names
for line in fdata: # iterate over every line
line = line.strip() # remove extra whitespaces and stuff
line = line.split(" ") # splits "B1,Baseball B2,Basketball"
# to ["B1,Baseball", "B2,Basketball"]
for item in line: # iterate over the items of a line
item = item.split(",") # splits "B1,Baseball"
# to ["B1", "Baseball"]
toys_dict[item[0]] = item[1] # saves {"B1": "Baseball"} to the dict
# check if the user toy code is in our dict
if user_toy_code in toys_dict:
print("The toy for toy code {} is: {}".format(user_toy_code, toys_dict[user_toy_code]))
else:
print("Toy code {} not found".format(user_toy_code))

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parse fasta sequence to the dictionary - python

It is better to use biopython library from Bio import SeqIO input_file = open("input.fasta") my_dict = SeqIO.to_dict(SeqIO.parse(input_file, "fasta"))

Related

multiple modification to a list at once

In Python, how to match a string to a dictionary item (like 'Bra*')

Counting keyword in the file

Can't get unique word/phrase counter to work - Python

find key in dictionary and print value

Categories

Resources